1. Load and Review the Dataset

1.1 Data Loading

# Check R Version to avoid issues
if (as.numeric(R.version$major) < 4 ||
    (as.numeric(R.version$major) == 4 && as.numeric(R.version$minor) < 3) ||
    (as.numeric(R.version$major) == 4 && as.numeric(R.version$minor) == 3 && as.numeric(R.version$patch) < 2)) {
  message("Warning: Your R version is not 4.3.2 or greater.")
  message("Please consider updating R to the latest version from:\nhttps://cran.r-project.org/")
}

# List of required packages
required_packages <- c(
  "plyr", "alr4", "caret", "car", "corrplot", "dplyr", "effects", "fastDummies", "ggplot2",
  "GGally", "ggplot2", "ggpubr", "glmnet", "lmtest", "MASS", "ModelMetrics", "kableExtra",
  "nortest", "olsrr", "onewaytests", "readr", "here", "stringr", "knitr", "reshape2", "leaflet",
  "RColorBrewer", "scales", "purrr", "DT", "jsonlite", "magrittr", "rpart", "broom", "neuralnet",
  "pscl"
)

# Establish CRAN for package installs
options(repos = c(CRAN = "https://ftp.osuosl.org/pub/cran/")) # Set the CRAN mirror

# Check if each package is installed; if not, install it
for (pkg in required_packages) {
  if (!(pkg %in% installed.packages()[,"Package"])) {
    install.packages(pkg, dependencies = TRUE)
  }
}

# Load all the packages without displaying masking warnings
lapply(required_packages, function(pkg) {
  suppressMessages(library(pkg, character.only = TRUE))
})

# Build the full path to the directory containing the Rmd file
rmd_dir <- dirname(here())

# Navigate up one directory and then to the CSV data file
csv_file <- file.path("KC_House_Sales.csv")

json_filepath <- file.path("model_parameters.json")

# Read the CSV file into a data frame
df <- read.csv(csv_file)

# Function to manually update the JSON file with model parameters
update_model_json <- function(model_name, features, filepath) {
  # Read existing parameters if the file exists, or initialize an empty list
  model_params <- if (file.exists(filepath)) {
                    fromJSON(filepath)
                  } else {
                    list()
                  }
  # Update the parameters for the specified model
  model_params[[model_name]] <- features
  # Write the updated parameters back to the JSON file
  write_json(model_params, filepath)
}

# Initialize the figure counter
fig_counter <- 0
table_counter <- 0

# Custom function to generate figure captions with automatic numbering
generate_figure_caption <- function(caption, section) {
  fig_counter <<- fig_counter + 1
  paste0("Figure ", section, ".", fig_counter, " ", caption)
}

# Custom function to generate figure captions with automatic numbering
generate_table_caption <- function(caption, section) {
  table_counter <<- table_counter + 1
  paste0("Table ", section, ".", table_counter, " ", caption)
}

1.2 Initial Data Inspection

Dataset Overview and Detailed Description

The King County house sales dataset is a comprehensive collection of 21,613 observations, each representing a unique house sale. The dataset encompasses a variety of features that describe different aspects of the houses sold. Below is a detailed description of each variable in the dataset:

# Create a data frame for the table
data_description <- data.frame(
  Variable = c(
    'id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
    'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above',
    'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat',
    'long', 'sqft_living15', 'sqft_lot15'
  ),
  Description = c(
    'Unique ID for each home sold (not used as a predictor)',
    'Date of the home sale',
    'Price of each home sold',
    'Number of bedrooms',
    'Number of bathrooms, ".5" accounts for a bathroom with a toilet but no shower',
    'Square footage of the apartment interior living space',
    'Square footage of the land space',
    'Number of floors',
    'A dummy variable for whether the apartment was overlooking the waterfront or not',
    'An index from 0 to 4 of how good the view of the property was',
    'An index from 1 to 5 on the condition of the apartment',
    'An index from 1 to 13 about building construction and design quality',
    'The square footage of the interior housing space above ground level',
    'The square footage of the interior housing space below ground level',
    'The year the house was initially built',
    'The year of the house’s last renovation',
    'The zipcode area the house is in',
    'Latitude coordinate',
    'Longitude coordinate',
    'The square footage of interior housing living space for the nearest 15 neighbors',
    'The square footage of the land lots of the nearest 15 neighbors'
  )
)

# Create the table with kable
data_description_table <- kable(
  data_description,
  format = "html",
  caption = generate_figure_caption(caption = "Data Description", section = 1)
) %>%
  kable_styling(full_width = TRUE) %>%
  column_spec(1, bold = TRUE)

# Print the table
data_description_table
Figure 1.1 Data Description
Variable Description
id Unique ID for each home sold (not used as a predictor)
date Date of the home sale
price Price of each home sold
bedrooms Number of bedrooms
bathrooms Number of bathrooms, “.5” accounts for a bathroom with a toilet but no shower
sqft_living Square footage of the apartment interior living space
sqft_lot Square footage of the land space
floors Number of floors
waterfront A dummy variable for whether the apartment was overlooking the waterfront or not
view An index from 0 to 4 of how good the view of the property was
condition An index from 1 to 5 on the condition of the apartment
grade An index from 1 to 13 about building construction and design quality
sqft_above The square footage of the interior housing space above ground level
sqft_basement The square footage of the interior housing space below ground level
yr_built The year the house was initially built
yr_renovated The year of the house’s last renovation
zipcode The zipcode area the house is in
lat Latitude coordinate
long Longitude coordinate
sqft_living15 The square footage of interior housing living space for the nearest 15 neighbors
sqft_lot15 The square footage of the land lots of the nearest 15 neighbors

1.3 Data Summary

The dataset contains housing information for a total of 21,613 houses. The prices of these houses range from the minimum price of $0 to a maximum of $9.9 million. On average, the houses in this dataset have a price of approximately $540,000. The median price, which represents the middle value when all prices are arranged in ascending order, is $450,000. The most common price range falls within the first quartile, where houses have prices around $321,000 to $645,000. The dataset also includes information on various other factors, such as the number of bedrooms, bathrooms, square footage of living space, lot size, and more, all of which can impact house prices. Understanding the distribution and characteristics of house prices in this dataset is essential for any analysis or modeling task related to real estate.


2. Create Train and Test Datasets

# Data Preprocessing and Transformation
set.seed(1023)  # Setting a seed for reproducibility
split_index <- sample(1:nrow(df), size = 0.7 * nrow(df))
train_df <- df[split_index, ]
test_df <- df[-split_index, ]

# Remove non-numeric characters from the 'price' column and convert it to numeric
df$price <- as.numeric(str_replace_all(df$price, "[^0-9.]", ""))

# Calculation of Convergence Point: Determine the convergence point for high-value homes
high_value_threshold <- quantile(df$price, probs = 0.95, na.rm = TRUE)  # Calculate the high-value threshold
high_value_homes <- df[df$price >= high_value_threshold, ]  # Select high-value homes
convergence_point <- c(mean(high_value_homes$lat, na.rm = TRUE), mean(high_value_homes$long, na.rm = TRUE))

# Remove non-numeric characters from the 'price' column and convert it to numeric
train_df$price <- as.numeric(str_replace_all(train_df$price, "[^0-9.]", ""))
test_df$price <- as.numeric(str_replace_all(test_df$price, "[^0-9.]", ""))

# Data Transformation Function with Distance Binning Option
transform_data <- function(df, convergence_point, linear_model) {
  # Date Transformation: Convert the 'date' column to a Date object if present
  if ("date" %in% colnames(df)) {
    df$date <- as.Date(substr(as.character(df$date), 1, 8), format="%Y%m%d")
    # Date-Time Feature Engineering: Extract various date-related features
    df$year_sold <- lubridate::year(df$date)
    df$month_sold <- lubridate::month(df$date)
    df$day_sold <- lubridate::day(df$date)
    df$season <- factor(lubridate::quarter(df$date), labels = c("Winter", "Spring", "Summer", "Fall"))
    df$week_of_year <- lubridate::week(df$date)
    df$day_of_year <- lubridate::yday(df$date)
  }
  # Creating Dummy Variables: Convert categorical variables into dummy variables
  df <- df %>%
    mutate(zipcode = as.factor(zipcode),
           waterfront = as.factor(waterfront),
           view = as.factor(view),
           condition = as.factor(condition),
           grade = as.numeric(grade),
           grade = case_when(
             grade %in% 1:3 ~ "Below_Average",
             grade %in% 4:10 ~ "Average",
             grade %in% 11:13 ~ "Above_Average")) %>%
    dummy_cols(select_columns = c('zipcode', 'view', 'condition', 'grade', 'waterfront', 'season'))
  # Remove last dummy variables to avoid multicollinearity
  if (linear_model) {
    df <- df[, !(names(df) %in% c("zipcode_98199", "view_0", "condition_1", "grade_13", "season_Winter", "waterfront_1"))]
  }
  # Haversine Distance Function: Calculate the distance between two points on Earth's surface
  haversine_distance <- function(lat1, long1, lat2, long2) {
    R <- 6371  # Earth radius in kilometers
    delta_lat <- (lat2 - lat1) * pi / 180
    delta_long <- (long2 - long1) * pi / 180
    a <- sin(delta_lat/2)^2 + cos(lat1 * pi / 180) * cos(lat2 * pi / 180) * sin(delta_long/2)^2
    c <- 2 * atan2(sqrt(a), sqrt(1 - a))
    d <- R * c  # Calculate the haversine distance
    return(d)
  }
  # Calculate Haversine Distance
  df$distance_to_convergence <- mapply(haversine_distance, df$lat, df$long,
                                       MoreArgs = list(lat2 = convergence_point[1], long2 = convergence_point[2]))
  # Remove columns that are no longer needed
  df <- df[, !(names(df) %in% c("id", "date", "zipcode", "view", "condition", "grade", "waterfront", "season"))]
  return(df)
}
# Applying the transformation function to training and test sets
train_df_linear <- transform_data(train_df, convergence_point, linear_model = TRUE)  # Transform the training data for linear models
test_df_linear <- transform_data(test_df, convergence_point, linear_model = TRUE)    # Transform the test data for linear models
train_df_non_linear <- transform_data(train_df, convergence_point, linear_model = FALSE)  # Transform the training data
test_df_non_linear <- transform_data(test_df, convergence_point, linear_model = FALSE)    # Transform the test data

# Set this to TRUE to update all the json model_parameters that are stored the JSON
# Check if the update_model_parameters is TRUE or not
update_model_parameters <- FALSE

# This updates the json with the parameters that were obtained from the intensive process of running
update_model_json <- function(model_name, features, filepath) {
  model_params <- if (file.exists(filepath)) {
                    fromJSON(filepath)
                  } else {
                    list()
                  }
  model_params[[model_name]] <- features
  write_json(model_params, filepath)
}

2.1 Data Cleaning and Transformation

The data preprocessing and transformation phase was crucial to prepare the dataset for accurate predictive analysis. This section outlines the key steps taken:

2.1.1 Exclusion of Non-Predictive Variables

Exclusion of Non-Predictive Variables: The dataset contained certain variables that were non-predictive in nature and therefore not useful for our regression model. Specifically, the id variable, serving as a unique identifier for each house sale, was removed to prevent it from influencing house price predictions. However, lat (latitude) and long (longitude) were retained for their potential role in calculating geographical distances, which could impact house prices.

2.1.2 Transformation of Data Types

Transformation of Data Types: To ensure consistency and suitability for modeling, several variables underwent data type transformation. Notably, the date variable, initially in string format, was converted into a numeric format to facilitate its incorporation into statistical models. Additionally, variables like price, sqft_living, sqft_lot, and others were converted to numeric formats.

2.1.3 Creation of Dummy Variables for Categorical Data

Creation of Dummy Variables for Categorical Data: Categorical variables such as waterfront, view, condition, and grade were transformed into dummy variables. This transformation was essential for regression analysis, as it enabled the inclusion of non-numeric variables in the model. The process involved converting these categorical variables into binary variables (0 or 1). This was particularly important for variables like waterfront, which is inherently binary, and for ordinal variables like view and condition, which possess an intrinsic order but needed numerical representation for modeling.

2.1.4 Handling Special Cases in Variables

Handling Special Cases in Variables: Variables like bathrooms, which could have values like “0.5” to represent bathrooms with a toilet but no shower, were retained in their original form. These nuanced representations were preserved, as they carried important information about the characteristics of the houses.

2.1.5 Grouping and Clustering of Variables

Grouping and Clustering of Variables: The zipcode variable underwent transformation by extracting the first three digits. This step reduced the number of dummy variables, preventing model complexity while still capturing geographical influences on house prices. Additionally, the grade variable was clustered into broader categories to simplify the model and focus on significant differences in construction and design quality.

2.1.6 Haversine Distance Calculation

Haversine Distance Calculation: A critical step was the calculation of Haversine distances. This involved creating a function to calculate the distance between two geographical points represented by latitude and longitude coordinates. The calculated haversine_distance was pivotal for understanding spatial relationships and proximity to key locations that might affect house prices.

2.1.7 Calculation of Convergence Point

Calculation of Convergence Point: A ‘convergence point’ was identified within the dataset, derived from houses with the highest values. This convergence point served as a reference to calculate the distance of each property from this central high-value location, potentially indicating a desirable area. Importantly, this step was executed on the training set alone to ensure the model accounted for locational desirability without data leakage.

2.2 Training Data Header

Table 2.1 Data Header
price bedrooms bathrooms sqft_living sqft_lot floors sqft_above sqft_basement yr_built yr_renovated lat long sqft_living15 sqft_lot15 year_sold month_sold day_sold week_of_year day_of_year zipcode_98001 zipcode_98002 zipcode_98003 zipcode_98004 zipcode_98005 zipcode_98006 zipcode_98007 zipcode_98008 zipcode_98010 zipcode_98011 zipcode_98014 zipcode_98019 zipcode_98022 zipcode_98023 zipcode_98024 zipcode_98027 zipcode_98028 zipcode_98029 zipcode_98030 zipcode_98031 zipcode_98032 zipcode_98033 zipcode_98034 zipcode_98038 zipcode_98039 zipcode_98040 zipcode_98042 zipcode_98045 zipcode_98052 zipcode_98053 zipcode_98055 zipcode_98056 zipcode_98058 zipcode_98059 zipcode_98065 zipcode_98070 zipcode_98072 zipcode_98074 zipcode_98075 zipcode_98077 zipcode_98092 zipcode_98102 zipcode_98103 zipcode_98105 zipcode_98106 zipcode_98107 zipcode_98108 zipcode_98109 zipcode_98112 zipcode_98115 zipcode_98116 zipcode_98117 zipcode_98118 zipcode_98119 zipcode_98122 zipcode_98125 zipcode_98126 zipcode_98133 zipcode_98136 zipcode_98144 zipcode_98146 zipcode_98148 zipcode_98155 zipcode_98166 zipcode_98168 zipcode_98177 zipcode_98178 zipcode_98188 zipcode_98198 view_1 view_2 view_3 view_4 condition_2 condition_3 condition_4 condition_5 grade_Above_Average grade_Average grade_Below_Average waterfront_0 season_Spring season_Summer season_Fall distance_to_convergence
890000 5 1.00 2590 4652 2 2310 280 1907 0 47.6038 -122.294 2360 4650 2014 10 9 41 282 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 5.094525
355000 3 1.75 1960 7705 1 980 980 1950 0 47.5300 -122.347 1380 4349 2014 9 23 38 266 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 13.117306
812500 4 2.75 2810 10300 1 1810 1000 1978 0 47.5626 -122.149 2710 9900 2014 8 8 32 220 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 8.516639
290000 4 2.50 2000 13300 1 1200 800 1968 0 47.3530 -122.294 1800 9810 2014 5 9 19 129 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 29.761115
410000 3 1.75 1660 6250 1 830 830 1980 0 47.5859 -122.385 1660 5750 2014 7 9 28 190 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 0 12.202504
496000 3 2.50 2180 4533 2 2180 0 2010 0 47.7540 -122.215 2180 7347 2014 11 17 46 321 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 15.268135

3. Data Preprocessing

3.1 Exclusion of Non-Predictive Variables

3.1.1 Exclusion of id Variable

  • The id variable, serving as a unique identifier for each house sale, was removed from the dataset. This exclusion aimed to prevent it from influencing house price predictions.

3.1.2 Retention of Geographic Variables

  • Unlike other non-predictive variables, lat (latitude) and long (longitude) were retained in the dataset. These geographic coordinates were preserved due to their potential role in calculating geographical distances, which could significantly impact house prices. While not directly predictive, they provide valuable spatial information.

3.2 Transformation of Data Types

3.2.1 Conversion of date Variable

  • The date variable, initially in string format, underwent a critical transformation. It was converted into a numeric format, allowing for easier incorporation into statistical models. Numeric representations of dates are more amenable to various types of analyses, including regression.

3.2.2 Conversion of Numeric Variables

  • Several variables, including price, sqft_living, sqft_lot, and others, underwent data type transformation. These variables were converted to numeric formats to ensure consistency and suitability for modeling purposes. Numeric representations are essential for performing mathematical operations and statistical modeling.

3.3 Creation of Dummy Variables for Categorical Data

3.3.1 Transformation of Categorical Variables

  • Categorical variables such as waterfront, view, condition, and grade were transformed into dummy variables. This transformation is pivotal for regression analysis, as it allows these non-numeric variables to be effectively included in the model.

3.3.2 Binary Representation

  • The process of creating dummy variables involved converting categorical variables into a series of binary variables (0 or 1). This binary representation is particularly important for variables like waterfront, which is a binary indicator itself, and for ordinal variables like view and condition. This transformation preserves the inherent order of these variables while making them suitable for numerical modeling.

3.4 Handling Special Cases in Variables

3.4.1 Treatment of bathrooms Variable

  • The bathrooms variable presented a unique challenge. It contains values like "0.5" that represent bathrooms with a toilet but no shower. In the code block, you can observe that we made a conscious decision not to apply any transformation to this variable. By doing so, we retained the nuanced details within the data. For instance, these values indicate specific house characteristics that can influence its price, such as the presence of a half-bathroom.

3.5 Grouping and Clustering of Variables

3.5.1 Transformation of zipcode Variable

  • In the prior code block, we demonstrate the transformation of the zipcode variable. This transformation involves extracting the first three digits of the zip code. The purpose of this transformation is twofold: it helps in reducing the number of dummy variables, preventing the model from becoming overly complex, while still retaining the essential geographical information. It allows us to capture the broader regional influences on house prices without adding excessive dimensions to the model.

3.5.2 Clustering of grade Variable

  • This transformation simplifies the variable into broader categories. By doing so, we aim to enhance the interpretability of the model while focusing on the significant differences in construction and design quality among houses. The model can now capture the essence of the grade variable without getting lost in its finer details.

3.6 Haversine Distance Calculation

3.6.1 Calculation of Haversine Distance

  • The prior code block illustrates the calculation of the Haversine distance. This calculation is performed with meticulous precision to incorporate the influence of location. A custom function, haversine_distance, is defined and applied to compute distances between geographical points represented by latitude and longitude. This step is critical for the model because it captures the significance of geographical proximity. It helps the model understand the spatial relationships and the impact of proximity to key locations on house prices.

3.7 Calculation of Convergence Point

3.7.1 Identification of Convergence Point

  • In the prior code block, you’ll notice the process of identifying a ‘convergence point.’ This central reference point is derived from houses with the highest values, essentially high-value homes within the dataset. This convergence point serves as a crucial reference for calculating the distance of each property from this central, high-value location. By doing so, we create a measure of proximity to desirable areas. Importantly, this calculation is meticulously handled in the code to ensure that it doesn’t introduce data leakage. It’s based solely on the training set, maintaining model integrity and avoiding the incorporation of information from the test set.

4. Build a Regression Model

4.1 Helper Functions for Model Comparison

4.1.1 add_model_performance Function

The add_model_performance function serves as a crucial tool for evaluating regression models. It takes as input a model object and conducts an evaluation using both training and testing datasets. The function calculates key performance metrics, including Sum of Squared Errors (SSE), R-squared, Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) for both the training and test datasets. These metrics provide valuable insights into the model’s predictive capabilities. Additionally, it offers the flexibility to incorporate the resulting performance metrics into an existing results dataframe or create a new one if none is provided. This function streamlines the evaluation process, making it easier to comprehensively compare and assess different models.

4.1.2 view_model_results Function

The view_model_results function simplifies the task of displaying model performance metrics in a user-friendly format. Leveraging the datatable function, it generates an interactive and visually appealing table for presenting model evaluation results. Users can input a dataframe containing model performance metrics, and this function will produce an informative table with options for adding captions. This feature enhances the communication and visualization of model performance, facilitating informed decision-making when comparing various models.

4.1.3 extract_and_display_features Function

The extract_and_display_features function emphasizes transparency and interpretability. Given a model object, it extracts and presents the features utilized by the model for making predictions. This transparency aids in understanding which variables influence the model’s decisions. By displaying the relevant features, users gain insights into the model’s feature selection process, enhancing model interpretability and assisting in feature engineering efforts.

4.1.4 create_model Function

The create_model function simplifies the creation of linear regression models based on user-defined feature subsets. Users can specify a dataframe containing their data, the target variable, and a list of features they wish to include in the model. The function constructs the model’s formula and fits a linear regression model accordingly. This flexibility empowers users to experiment with different feature combinations and observe their impact on model performance, aiding in the model selection process.

4.1.5 prepare_data Function

The prepare_data function streamlines the data preparation process for model evaluation. It extracts features used in the model, creates training and testing datasets without the target variable, and organizes the data for model evaluation. By automating these steps, this function ensures consistency and accuracy in the data used for model assessment. Users can easily interface with the prepared data, saving time and reducing the risk of errors during the evaluation process.

4.1.6 evaluate_model Function

The evaluate_model function combines the elements of model evaluation and result aggregation. Given a model, training and testing datasets, and a target variable, it evaluates the model’s performance and adds the resulting metrics to a results dataframe. Users can specify a model name for identification purposes, facilitating easy tracking of multiple model evaluations. This function streamlines the process of comparing various models by accumulating their performance metrics in a single dataframe, simplifying the decision-making process when selecting the best model for a given task.

Collectively, these functions provide a comprehensive framework for efficiently evaluating and comparing regression models. They enhance transparency, simplify the evaluation process, and empower users to make informed decisions regarding model selection and feature engineering.

#' Add Model Performance to Dataframe
#'
#' This function evaluates a given regression model using training and testing datasets. It calculates
#' performance metrics like SSE, R-squared, RMSE, and MAE for both training and test sets, and then adds
#' these metrics to a provided results dataframe or creates one if not provided. The function ensures
#' that both training and testing datasets contain the same features used in the model before performing
#' predictions and calculations.
#'
#' @param model_name A string representing the name of the model.
#' @param model The model object to be evaluated.
#' @param x_train A dataframe containing the features of the training data.
#' @param y_train A vector containing the target variable of the training data.
#' @param x_test A dataframe containing the features of the test data.
#' @param y_test A vector containing the target variable of the test data.
#' @param df_results An optional dataframe where model performance metrics will be added (default: NULL).
#' @return The updated dataframe with the added model performance metrics.
#'
#' @examples
#' linear_model <- lm(price ~ ., data = train_df)
#' df_results <- add_model_performance("Linear Model", linear_model, x_train, y_train, x_test, y_test)
#'
#' # If df_results already exists and you want to add more results:
#' df_results <- add_model_performance("Another Model", another_model, x_train, y_train, x_test, y_test, df_results)
#'
#' @details
#' The function first extracts the features used in the model and checks if these features are present in
#' both the training and testing datasets. It then uses the model to predict the target variable on both
#' datasets and calculates performance metrics. These metrics are added to a dataframe that either exists
#' or is created within the function. This dataframe can be used for comparing different models' performances.
#'
#' It is assumed that the model is correctly specified with the appropriate features and the dataframes
#' provided to the function align with the model's structure.

add_model_performance <- function(model_name, model, x_train, y_train, x_test, y_test, df_results = NULL) {
    # Create df_results if not provided
    if (is.null(df_results)) {
        df_results <- data.frame(
            Model = character(),
            SSE_train = double(),
            SSE_test = double(),
            R_squared_train = double(),
            R_squared_test = double(),
            RMSE_train = double(),
            RMSE_test = double(),
            MAE_train = double(),
            MAE_test = double(),
            stringsAsFactors = FALSE
        )
    }

    y_hat_train <- predict(model, newdata = x_train)
    y_hat_test <- predict(model, newdata = x_test)

    # Performance metrics calculation
    mae_train <- mean(abs(y_train - y_hat_train))
    mae_test <- mean(abs(y_test - y_hat_test))

    sse_train <- sum((y_train - y_hat_train)^2)
    sse_test <- sum((y_test - y_hat_test)^2)

    tss_train <- sum((y_train - mean(y_train))^2)
    tss_test <- sum((y_test - mean(y_test))^2)

    rsq_train <- 1 - (sse_train / tss_train)
    rsq_test <- 1 - (sse_test / tss_test)

    rmse_train <- sqrt(mean((y_train - y_hat_train)^2))
    rmse_test <- sqrt(mean((y_test - y_hat_test)^2))

    # Appending results to the dataframe
    new_row <- data.frame(
        Model = model_name,
        SSE_train = sse_train,
        SSE_test = sse_test,
        R_squared_train = rsq_train,
        R_squared_test = rsq_test,
        RMSE_train = rmse_train,
        RMSE_test = rmse_test,
        MAE_train = mae_train,
        MAE_test = mae_test
    )

    df_results <- rbind(df_results, new_row)

    # Returning the updated dataframe
    return(df_results)
}


#' Display Model Results using Datatable
#'
#' This function displays a dataframe containing model performance metrics using the `datatable` function.
#'
#' @param df_results A dataframe containing model performance metrics.
#' @param caption Optional caption for the table (default: "Model Comparison").
#' @return NULL (it displays the table but doesn't return a value).

view_model_results <- function(df_results, caption) {
  # Identifying numeric columns other than "Model", "R_squared_train", and "R_squared_test"
  cols_to_round <- setdiff(names(df_results[sapply(df_results, is.numeric)]), c("Model", "R_squared_train", "R_squared_test"))

  # Round these specific columns to 2 decimal places
  df_results[cols_to_round] <- lapply(df_results[cols_to_round], round, 2)

  # Round R-squared columns to 4 decimal places
  df_results$R_squared_train <- round(df_results$R_squared_train, 5)
  df_results$R_squared_test <- round(df_results$R_squared_test, 5)

  # Display the dataframe using datatable
  datatable(
    df_results,
    caption = caption,
    options = list(
      paging = FALSE,
      autoWidth = TRUE,
      scrollX = TRUE,
      fixedColumns = list(leftColumns = 1)
    )
  )
}


extract_and_display_features <- function(model, full_df, target_var) {
  features_used <- setdiff(names(coef(model)), "(Intercept)")
  features_used <- features_used[features_used != "(Intercept)"]

  # Get all the feature names from the full dataset, excluding the target variable
  all_features <- names(full_df)
  all_features <- all_features[all_features != target_var]

  # Identify features not used in the model and print if desired
  # unused_features <- setdiff(all_features, features_used)

  # Return the features that were used as a character vector
  return(features_used)
}

# Function to create a linear model based on a subset of features
create_model <- function(df, target_var, features) {
  # Construct the formula for lm()
  formula <- as.formula(paste(target_var, "~", paste(features, collapse = "+")))
  # Fit the linear model
  return(lm(formula, data = df))
}

prepare_data <- function(model, train_df, test_df, target_var) {
  # Extracting used features from the model
  used_features <- extract_and_display_features(model, train_df, target_var = target_var)

  # Create x_train and x_test without the target variable
  x_train <- subset(train_df, select = used_features)
  x_test <- subset(test_df, select = used_features)

  y_train <- train_df[[target_var]]
  y_test <- test_df[[target_var]]

  return(list(x_train = x_train, x_test = x_test, y_train = y_train, y_test = y_test))
}

evaluate_model <- function(model_name, model, train_df, test_df, target_var, df_results) {
  # Filter data into x and y subsets
  data <- prepare_data(model, train_df, test_df, target_var = target_var)

  # Adding model performance to the results dataframe
  df_results <- add_model_performance(
    model_name = model_name,
    model = model,
    x_train = data$x_train,
    y_train = data$y_train,
    x_test = data$x_test,
    y_test = data$y_test,
    df_results = df_results  # Or pass existing df_results if available
  )

  return(df_results)
}

# Create an empty coefficients data frame if it doesn't exist, otherwise add rows to it
create_coefficients_df <- function(model, model_name = "Your Model Name", coefficients_df = NULL) {
  # Check the type of model and extract coefficients accordingly
  if (inherits(model, "lm")) {
    coefficients <- coef(model)
  } else if (inherits(model, "dgCMatrix")) { # If it's a matrix from glmnet
    coefficients <- as.vector(model) # Convert matrix to vector
    names(coefficients) <- rownames(model) # Use rownames from matrix as names
  } else {
    stop("Input model is not a recognized type.")
  }

  # Prepare the model data as a dataframe row
  model_row <- data.frame(t(coefficients), check.names = FALSE)
  model_row$Model_Name <- model_name  # Add the model name as a new column

  # If coefficients_df is not provided, initialize a new data frame
  if (is.null(coefficients_df)) {
    coefficients_df <- model_row
  } else {
    # Align the model_row with the coefficients_df
    # Adding NA columns for missing features in coefficients_df
    missing_cols_in_df <- setdiff(names(model_row), names(coefficients_df))
    if (length(missing_cols_in_df) > 0) {
      coefficients_df[missing_cols_in_df] <- NA
    }

    # Adding NA columns for missing features in model_row
    missing_cols_in_row <- setdiff(names(coefficients_df), names(model_row))
    if (length(missing_cols_in_row) > 0) {
      model_row[missing_cols_in_row] <- NA
    }

    # Ensure both dataframes have the same column order
    model_row <- model_row[names(coefficients_df)]

    # Bind the new row to the existing dataframe
    coefficients_df <- rbind(coefficients_df, model_row)
  }

  # Sort the columns alphabetically, except 'Model_Name'
  cols_order <- c("Model_Name", sort(setdiff(names(coefficients_df), "Model_Name")))
  coefficients_df <- coefficients_df[, cols_order]

  return(coefficients_df)
}

4.2 Model Building

# Fit a linear regression model to the training data
linear_model_initial <- lm(price ~ ., data = train_df_linear)

# Initalize and start a coefficients_df to examine later
coefficients_df <- create_coefficients_df(linear_model_initial, "Initial OLS Model")

# Evaluate OLS_linear
df_results <- evaluate_model("OLS_linear", linear_model_initial, train_df_linear, test_df_linear, target_var = 'price', NULL)

# Add results to the df_results to view and sort later
view_model_results(df_results, caption = generate_table_caption("OLS Linear Model Table", section = 4))

4.3 Model Evaluation

# Show inital linear model results
summary(linear_model_initial)
## 
## Call:
## lm(formula = price ~ ., data = train_df_linear)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1415466   -62623     2607    59286  4245843 
## 
## Coefficients: (2 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -6.158e+07  1.366e+07  -4.509 6.56e-06 ***
## bedrooms                -2.146e+04  1.736e+03 -12.363  < 2e-16 ***
## bathrooms                2.285e+04  2.994e+03   7.634 2.42e-14 ***
## sqft_living              1.350e+02  4.033e+00  33.483  < 2e-16 ***
## sqft_lot                 3.055e-01  4.669e-02   6.542 6.25e-11 ***
## floors                  -3.145e+04  3.666e+03  -8.577  < 2e-16 ***
## sqft_above               7.569e+01  4.143e+00  18.268  < 2e-16 ***
## sqft_basement                   NA         NA      NA       NA    
## yr_built                -9.304e+00  7.239e+01  -0.129 0.897732    
## yr_renovated             3.122e+01  3.384e+00   9.224  < 2e-16 ***
## lat                     -1.572e+05  7.696e+04  -2.042 0.041156 *  
## long                     2.761e+05  6.335e+04   4.358 1.32e-05 ***
## sqft_living15            3.335e+01  3.253e+00  10.254  < 2e-16 ***
## sqft_lot15              -1.416e-01  6.931e-02  -2.043 0.041045 *  
## year_sold                5.138e+04  5.139e+03   9.999  < 2e-16 ***
## month_sold               2.958e+05  6.753e+04   4.381 1.19e-05 ***
## day_sold                 9.429e+03  2.220e+03   4.246 2.19e-05 ***
## week_of_year            -3.651e+03  3.777e+03  -0.967 0.333807    
## day_of_year             -9.050e+03  2.281e+03  -3.968 7.29e-05 ***
## zipcode_98001           -2.903e+05  2.905e+04  -9.993  < 2e-16 ***
## zipcode_98002           -2.982e+05  3.078e+04  -9.689  < 2e-16 ***
## zipcode_98003           -2.774e+05  2.919e+04  -9.504  < 2e-16 ***
## zipcode_98004            1.999e+05  2.382e+04   8.391  < 2e-16 ***
## zipcode_98005           -2.460e+05  2.613e+04  -9.412  < 2e-16 ***
## zipcode_98006           -2.606e+05  2.426e+04 -10.740  < 2e-16 ***
## zipcode_98007           -2.842e+05  2.757e+04 -10.308  < 2e-16 ***
## zipcode_98008           -2.795e+05  2.568e+04 -10.885  < 2e-16 ***
## zipcode_98010           -2.756e+05  3.688e+04  -7.471 8.37e-14 ***
## zipcode_98011           -2.987e+05  2.175e+04 -13.733  < 2e-16 ***
## zipcode_98014           -2.992e+05  3.576e+04  -8.366  < 2e-16 ***
## zipcode_98019           -3.279e+05  2.996e+04 -10.945  < 2e-16 ***
## zipcode_98022           -2.498e+05  4.049e+04  -6.170 7.00e-10 ***
## zipcode_98023           -2.687e+05  2.926e+04  -9.186  < 2e-16 ***
## zipcode_98024           -3.047e+05  3.647e+04  -8.353  < 2e-16 ***
## zipcode_98027           -3.025e+05  2.727e+04 -11.092  < 2e-16 ***
## zipcode_98028           -2.869e+05  1.928e+04 -14.883  < 2e-16 ***
## zipcode_98029           -2.495e+05  2.855e+04  -8.737  < 2e-16 ***
## zipcode_98030           -3.644e+05  2.747e+04 -13.268  < 2e-16 ***
## zipcode_98031           -3.911e+05  2.589e+04 -15.105  < 2e-16 ***
## zipcode_98032           -3.441e+05  2.853e+04 -12.062  < 2e-16 ***
## zipcode_98033           -1.448e+05  2.097e+04  -6.902 5.32e-12 ***
## zipcode_98034           -2.591e+05  1.864e+04 -13.898  < 2e-16 ***
## zipcode_98038           -3.538e+05  3.043e+04 -11.629  < 2e-16 ***
## zipcode_98039            5.373e+05  3.368e+04  15.955  < 2e-16 ***
## zipcode_98040           -1.050e+04  2.302e+04  -0.456 0.648396    
## zipcode_98042           -3.754e+05  2.818e+04 -13.324  < 2e-16 ***
## zipcode_98045           -2.583e+05  3.911e+04  -6.604 4.13e-11 ***
## zipcode_98052           -2.567e+05  2.252e+04 -11.398  < 2e-16 ***
## zipcode_98053           -2.794e+05  2.627e+04 -10.633  < 2e-16 ***
## zipcode_98055           -4.032e+05  2.388e+04 -16.884  < 2e-16 ***
## zipcode_98056           -4.069e+05  2.325e+04 -17.499  < 2e-16 ***
## zipcode_98058           -4.151e+05  2.501e+04 -16.597  < 2e-16 ***
## zipcode_98059           -4.181e+05  2.459e+04 -17.001  < 2e-16 ***
## zipcode_98065           -3.471e+05  3.369e+04 -10.303  < 2e-16 ***
## zipcode_98070           -2.809e+05  2.823e+04  -9.952  < 2e-16 ***
## zipcode_98072           -2.732e+05  2.283e+04 -11.968  < 2e-16 ***
## zipcode_98074           -3.050e+05  2.623e+04 -11.631  < 2e-16 ***
## zipcode_98075           -3.064e+05  2.769e+04 -11.065  < 2e-16 ***
## zipcode_98077           -3.164e+05  2.646e+04 -11.958  < 2e-16 ***
## zipcode_98092           -3.381e+05  3.071e+04 -11.008  < 2e-16 ***
## zipcode_98102            7.898e+04  2.148e+04   3.678 0.000236 ***
## zipcode_98103           -8.021e+04  1.369e+04  -5.858 4.79e-09 ***
## zipcode_98105            1.914e+04  1.872e+04   1.022 0.306613    
## zipcode_98106           -3.009e+05  1.713e+04 -17.564  < 2e-16 ***
## zipcode_98107           -4.959e+04  1.548e+04  -3.202 0.001366 ** 
## zipcode_98108           -3.474e+05  2.044e+04 -16.993  < 2e-16 ***
## zipcode_98109            8.410e+04  2.097e+04   4.010 6.10e-05 ***
## zipcode_98112            1.733e+05  1.911e+04   9.070  < 2e-16 ***
## zipcode_98115           -1.165e+05  1.516e+04  -7.683 1.65e-14 ***
## zipcode_98116           -1.168e+05  1.538e+04  -7.590 3.39e-14 ***
## zipcode_98117           -6.259e+04  1.336e+04  -4.684 2.84e-06 ***
## zipcode_98118           -3.281e+05  1.923e+04 -17.059  < 2e-16 ***
## zipcode_98119            8.355e+04  1.696e+04   4.927 8.42e-07 ***
## zipcode_98122           -1.560e+05  1.896e+04  -8.226  < 2e-16 ***
## zipcode_98125           -2.224e+05  1.586e+04 -14.024  < 2e-16 ***
## zipcode_98126           -2.151e+05  1.652e+04 -13.019  < 2e-16 ***
## zipcode_98133           -2.045e+05  1.527e+04 -13.389  < 2e-16 ***
## zipcode_98136           -1.430e+05  1.750e+04  -8.170 3.33e-16 ***
## zipcode_98144           -2.205e+05  1.876e+04 -11.750  < 2e-16 ***
## zipcode_98146           -3.162e+05  1.884e+04 -16.785  < 2e-16 ***
## zipcode_98148           -3.159e+05  2.921e+04 -10.817  < 2e-16 ***
## zipcode_98155           -2.349e+05  1.666e+04 -14.097  < 2e-16 ***
## zipcode_98166           -3.213e+05  2.088e+04 -15.389  < 2e-16 ***
## zipcode_98168           -3.716e+05  2.049e+04 -18.136  < 2e-16 ***
## zipcode_98177           -1.182e+05  1.784e+04  -6.625 3.59e-11 ***
## zipcode_98178           -4.312e+05  2.213e+04 -19.482  < 2e-16 ***
## zipcode_98188           -3.848e+05  2.521e+04 -15.265  < 2e-16 ***
## zipcode_98198           -3.478e+05  2.444e+04 -14.230  < 2e-16 ***
## view_1                   7.219e+04  1.037e+04   6.964 3.45e-12 ***
## view_2                   7.069e+04  6.395e+03  11.054  < 2e-16 ***
## view_3                   1.712e+05  8.963e+03  19.098  < 2e-16 ***
## view_4                   3.517e+05  1.346e+04  26.128  < 2e-16 ***
## condition_2              1.179e+05  3.612e+04   3.263 0.001104 ** 
## condition_3              1.340e+05  3.326e+04   4.029 5.63e-05 ***
## condition_4              1.583e+05  3.327e+04   4.759 1.96e-06 ***
## condition_5              2.049e+05  3.347e+04   6.122 9.50e-10 ***
## grade_Above_Average      2.119e+05  9.063e+04   2.339 0.019372 *  
## grade_Average           -1.414e+05  9.005e+04  -1.570 0.116407    
## grade_Below_Average             NA         NA      NA       NA    
## waterfront_0            -5.148e+05  1.901e+04 -27.075  < 2e-16 ***
## season_Spring            1.012e+04  5.904e+03   1.714 0.086462 .  
## season_Summer            1.257e+03  9.637e+03   0.130 0.896191    
## season_Fall             -9.601e+03  1.414e+04  -0.679 0.497116    
## distance_to_convergence -1.010e+04  7.672e+02 -13.161  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 154600 on 15027 degrees of freedom
## Multiple R-squared:  0.8186, Adjusted R-squared:  0.8174 
## F-statistic: 671.4 on 101 and 15027 DF,  p-value: < 2.2e-16

4.3.1 Overview of Model Performance

In this section, we provide an in-depth evaluation of the initial linear regression model’s performance from a data modeling perspective in the context of predicting house prices.

4.3.2 Model Fit

The model exhibits a respectable fit to the data with an adjusted R-squared value of 0.8174. This metric suggests that approximately 81.74% of the variance in house prices is accounted for by the independent variables included in the model. While this is a strong start, it’s important to explore whether there is room for model improvement.

4.3.3 Significant Predictors

Several predictors stand out as statistically significant contributors to the model’s predictive power. Notably, variables such as bedrooms, bathrooms, sqft_living, yr_renovated, lat, long, and view have coefficients with p-values less than 0.05. These variables have a substantial influence on predicting house prices and are essential components of the model.

4.3.4 Baseline and Intercept

The (Intercept) term represents the baseline house price when all other predictors are set to zero. It is crucial in understanding the inherent value of a house. Interpretation of the baseline price helps assess the model’s ability to capture the impact of other variables.

4.3.5 Challenges and Missing Coefficients

The model also reveals challenges, such as missing coefficients for some variables (e.g., sqft_basement), indicating potential data quality issues. Addressing these gaps is vital to enhance the model’s performance and interpretability.

4.3.6 Residual Analysis

Residual analysis indicates that the model’s residuals range from -1,415,466 to 4,245,843, suggesting the presence of heteroscedasticity or outliers. Deeper investigation into these issues is necessary to refine the model and ensure its robustness.

4.3.7 Feature Engineering

The model incorporates a wide range of features, including property characteristics, location-based attributes, and temporal variables. These features collectively contribute to its predictive capacity. Ongoing feature engineering efforts should focus on selecting relevant variables and transforming them effectively to improve the model’s accuracy.

4.3.8 Model Optimization

While the initial model provides valuable insights, further optimization is warranted. Exploring alternative regression techniques, addressing multicollinearity among predictors, and employing feature selection methods can help enhance model performance.

4.3.9 Next Steps

We will commence with a comprehensive exploration of our dataset in Section 5, utilizing various visualization techniques to gain insights into the relationships between variables. This exploration includes scatter plots (Section 5.1), a correlation matrix (Section 5.2), and in-depth relationship analysis (Section 5.3).

In Section 6, we will employ a stepwise model selection approach to identify the most relevant predictors for our linear regression model. This process will be detailed in Section 6.1. Subsequently, we will present the results for the best linear model iterations in Sections 6.2 and 6.3, followed by a model comparison in Section 6.4.

Ensuring the validity of linear regression assumptions is vital, and we will conduct thorough checks in Section 7. This includes assessing the linearity assumption (Section 7.1), examining the normality of residuals (Section 7.2), and verifying homoscedasticity (constant variance) (Section 7.3).

In Section 8, we will address issues related to linearity, normality, heteroscedasticity and multicollinearity. We will detect and analyze heteroscedasticity in Section 8, with the presentation of remedial measures.

As we progress, we will explore alternative modeling techniques in Section 9. This includes the implementation of a regression tree model (Section 9.1), a neural network model (Section 9.2), and a support vector machine (SVM) model (Section 9.3). If applicable, we will also consider a logistic regression model (Section 9.4). These alternative models will provide valuable insights and potential enhancements to our predictive capabilities.


5. Data Exploration and Visualization

5.1 Continuous Variable Plots

5.1.1 Price vs. Square Footage of Living Space

# Scatter plot of Price vs. Square Footage of Living Space
ggplot(data = train_df_non_linear, aes(x = sqft_living, y = price)) +
  geom_point(pch = 20, col = "blue") +
  labs(title = "Price vs. Square Footage of Living Space",
       subtitle = "Seattle Housing Data",
       x = "Sqft Living Space",
       y = "Price",
       caption = generate_figure_caption("Price vs. Square Footage of Living Space", section = 5))

In the scatter plot above, we compare the price of homes against their sqft_living (square footage of interior living space). This visualization allows us to explore the relationship between these two variables.

# Distribution of Square Footage of Living Space
ggplot(data = train_df_non_linear, aes(x = sqft_living)) +
  geom_histogram(bins = 50) +
  labs(title = "Distribution of Sqft Living Space",
       x = "Sqft Living Space",
       y = "Density",
       caption = generate_figure_caption("Distribution of Sqft Living Space", section = 5))

The histogram above displays the distribution of sqft_living. It reveals that the variable is right-skewed, with most homes having smaller living spaces and relatively fewer very large living spaces.

5.1.2 Price vs. Square Footage of Lot

# Scatter plot of Price vs. Square Footage of Lot
ggplot(data = train_df_non_linear, aes(x = sqft_lot, y = price)) +
  geom_point(pch = 20, col = "green") +
  labs(title = "Price vs. Sqft Lot",
       x = "Sqft Lot",
       y = "Price",
       caption = generate_figure_caption("Price vs. Sqft Lot", section = 5))

The scatter plot above compares price against sqft_lot (square footage of land space). It helps us understand if there’s any relationship between the size of the lot and the sale price.

# Distribution of Square Footage of Lot
ggplot(data = train_df_non_linear, aes(x = sqft_lot)) +
  geom_histogram(bins = 50) +
  labs(title = "Distribution of Sqft Lot",
       x = "Sqft Lot",
       y = "Density",
       caption = generate_figure_caption("Distribution of Sqft Lot", section = 5))

The histogram above visualizes the distribution of sqft_lot. Similar to sqft_living, this variable is right-skewed, with most homes having smaller lot sizes and relatively fewer very large lots.

5.1.3 Price vs. Square Footage Above Ground

# Scatter plot of Price vs. Square Footage Above Ground
ggplot(data = train_df_non_linear, aes(x = sqft_above, y = price)) +
  geom_point(pch = 20, col = "red") +
  labs(title = "Price vs. Sqft Above Ground",
       x = "Sqft Above Ground",
       y = "Price",
       caption = generate_figure_caption("Price vs. Sqft Above Ground", section = 5))

In the scatter plot above, we compare price against sqft_above (square footage of the interior housing space above ground level). This analysis helps us explore the impact of above-ground living space on home prices.

# Distribution of Square Footage Above Ground
ggplot(data = train_df_non_linear, aes(x = sqft_above)) +
  geom_histogram(bins = 50) +
  labs(title = "Distribution of Sqft Above Ground",
       x = "Sqft Above Ground",
       y = "Density",
       caption = generate_figure_caption("Distribution of Sqft Above Ground", section = 5))

The histogram above shows the distribution of sqft_above. It suggests that most homes have similar above-ground square footage, with relatively fewer having significantly larger or smaller above-ground spaces.

5.1.4 Price vs. Square Footage of Basement

Excluding homes that do not have a basement.

# Scatter plot of Price vs. Square Footage of Basement (excluding 0 sqft basement)
# Filter data for non-zero sqft_basement
filtered_data <- train_df_non_linear[train_df_non_linear$sqft_basement > 0,]

# Scatter plot with custom caption
ggplot(data = filtered_data, aes(x = sqft_basement, y = price)) +
  geom_point(pch = 20, col = "purple") +
  labs(title = "Price vs. Sqft Basement",
       x = "Sqft Basement",
       y = "Price",
       caption = generate_figure_caption("Price vs. Sqft Basement (Non-Zero Values)", section = 5))

The scatter plot above compares price against sqft_basement (square footage of the interior housing space below ground level). This visualization helps us understand if the presence and size of a basement influence home prices.

# Distribution of Square Footage of Basement (excluding 0 values)
# Histogram with custom caption
ggplot(data = filtered_data, aes(x = sqft_basement)) +
  geom_histogram(bins = 50) +
  labs(title = "Distribution of Sqft Basement",
       x = "Sqft Basement",
       y = "Density",
       caption = generate_figure_caption("Distribution of Sqft Basement (Non-Zero Values)", section = 5))

The histogram above visualizes the distribution of sqft_basement. It indicates that most homes have little to no basement space, while some have larger basement areas.

5.1.5 Price vs. Year Built

# Scatter plot of Price vs. Year Built
ggplot(data = train_df_non_linear, aes(x = yr_built, y = price)) +
  geom_point(pch = 20, col = "orange") +
  labs(title = "Price vs. Year Built",
       x = "Year Built",
       y = "Price",
       caption = generate_figure_caption("Price vs. Year Built", section = 5))

The scatter plot above compares price against the year when homes were initially built (yr_built). This analysis helps us understand how the age of a home relates to its sale price.

# Distribution of Year Built
ggplot(data = train_df_non_linear, aes(x = yr_built)) +
  geom_histogram(bins = 50) +
  labs(title = "Distribution of Year Built",
       x = "Year Built",
       y = "Density",
       caption = generate_figure_caption("Distribution of Year Built", section = 5))

The histogram above displays the distribution of yr_built. It provides insights into the distribution of home ages in the dataset.

5.1.6 Price vs. Year of Last Renovation

Excluding homes that did not have a documented renovation.

# Find lowest non-zero year renovated
lowest_non_zero_renovation_year <- min(train_df_non_linear$yr_renovated[train_df_non_linear$yr_renovated > 0]) - 1

# Filter data for non-zero yr_renovated
filtered_data <- train_df_non_linear[train_df_non_linear$yr_renovated > 0,]

# Scatter plot of Price vs. Year Renovated
lowest_non_zero_renovation_year <- min(train_df_non_linear$yr_renovated[train_df_non_linear$yr_renovated > 0]) - 1
ggplot(data = filtered_data, aes(x = yr_renovated, y = price)) +
  geom_point(pch = 20, col = "brown") +
  labs(title = "Price vs. Year Renovated",
       x = "Year Renovated",
       y = "Price",
       caption = generate_figure_caption("Price vs. Year Renovated (Non-Zero Values)", section = 5)) +
  xlim(c(lowest_non_zero_renovation_year, max(train_df_non_linear$yr_renovated)))

In the scatter plot above, we compare price against the year of the last renovation (yr_renovated). This analysis helps us understand whether recent renovations impact home prices.

# Find lowest non-zero year renovated
lowest_non_zero_renovation_year <- min(train_df_non_linear$yr_renovated[train_df_non_linear$yr_renovated > 0]) - 1

# Filter data for non-zero yr_renovated
filtered_data <- train_df_non_linear[train_df_non_linear$yr_renovated > 0,]

# Histogram of Year Renovated
ggplot(data = filtered_data, aes(x = yr_renovated)) +
  geom_histogram(fill = "orange") +
  labs(title = "Histogram of Year Renovated",
       x = "Year Renovated",
       y = "Density",
       caption = generate_figure_caption("Histogram of Year Renovated (Non-Zero Values)", section = 5)) +
  xlim(c(lowest_non_zero_renovation_year, max(train_df_non_linear$yr_renovated)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histogram above visualizes the distribution of yr_renovated. It provides insights into the distribution of renovation years in the dataset.

5.1.7 Price vs. Distance to Convergence

# Scatter plot of Price vs. Distance to Convergence
ggplot(data = train_df_non_linear, aes(x = distance_to_convergence, y = price)) +
  geom_point(pch = 20, col = "violet") +
  labs(title = "Price vs. Distance to Convergence",
       x = "Distance to Convergence",
       y = "Price",
       caption = generate_figure_caption("Price vs. Distance to Convergence", section = 5))

The scatter plot above compares price against distance_to_convergence. This analysis helps us explore whether the distance to a convergence point impacts home prices.

# Distribution of Distance to Convergence
ggplot(data = train_df_non_linear, aes(x = distance_to_convergence)) +
  geom_histogram(bins = 50) +
  labs(title = "Distribution of Distance to Convergence",
       x = "Distance to Convergence",
       y = "Density",
       caption = generate_figure_caption("Distribution of Distance to Convergence", section = 5))

5.2 Categorical Variable Analysis

The distribution and count of categorical variables such as bedrooms, bathrooms, floors, waterfront, view, condition, and grade are analyzed.

5.2.1 Price vs. Bedrooms

# Convert bedrooms to factor
train_df_non_linear$bedrooms_factor <- factor(train_df_non_linear$bedrooms)

# Binned Boxplot of Price vs. Bedrooms
ggplot(data = train_df_non_linear, aes(x = bedrooms_factor, y = price)) +
  geom_boxplot(fill = "blue") +
  labs(title = "Price vs. Bedrooms",
       x = "Bedrooms",
       y = "Price",
       caption = generate_figure_caption("Price vs. Bedrooms", section = 5))

The scatter plot above compares price against the number of bedrooms. This visualization helps us understand how the number of bedrooms influences home prices.

# Filter data excluding 33 bedrooms
filtered_bedrooms <- train_df_non_linear$bedrooms[train_df_non_linear$bedrooms != 33]

# Calculate frequencies of each bedroom count
bedroom_frequencies <- table(filtered_bedrooms)

ggplot(data = data.frame(filtered_bedrooms = as.factor(names(bedroom_frequencies)),
                        filtered_counts = as.numeric(bedroom_frequencies)),
       aes(x = filtered_bedrooms, y = filtered_counts)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Distribution of Bedrooms (Excluding 33 Bedrooms)",
       x = "Number of Bedrooms",
       y = "Frequency",
       caption = generate_figure_caption("Distribution of Bedrooms (Excluding 33 Bedrooms)", section = 5))

The bar plot above displays the distribution of the bedrooms variable, showing the frequency of each bedroom count.

5.2.2 Price vs. Bathrooms

# Convert bathrooms to factor
train_df_non_linear$bathrooms_factor <- factor(train_df_non_linear$bathrooms)

# Binned Boxplot of Price vs. Bathrooms
ggplot(data = train_df_non_linear, aes(x = bathrooms_factor, y = price)) +
  geom_boxplot(fill = "green") +
  labs(title = "Price vs. Bathrooms",
       x = "Bathrooms",
       y = "Price",
       caption = generate_figure_caption("Price vs. Bathrooms", section = 5))

In the scatter plot above, we compare price against the number of bathrooms. This analysis helps us explore the relationship between the number of bathrooms and home prices.

# Get data for bar plot
bathrooms_counts <- table(train_df_non_linear$bathrooms)
bathrooms <- as.numeric(names(bathrooms_counts))
counts <- as.numeric(bathrooms_counts)

# Bar plot for the distribution of Bathrooms
ggplot(data = data.frame(bathrooms, counts), aes(x = bathrooms, y = counts)) +
  geom_bar(stat = "identity", fill = "green") +
  labs(title = "Distribution of Bathrooms",
       x = "Number of Bathrooms",
       y = "Frequency",
       caption = generate_figure_caption("Distribution of Bathrooms", section = 5))

The bar plot above visualizes the distribution of the bathrooms variable, showing the frequency of each bathroom count.

5.2.3 Price vs. Floors

# Binned Boxplot of Price vs. Floors
ggplot(data = train_df_non_linear, aes(x = floors, y = price, group = floors)) +
  geom_boxplot(fill = "orange") +
  labs(title = "Price vs. Floors",
       x = "Floors",
       y = "Price",
       caption = generate_figure_caption("Price vs. Floors", section = 5))

The scatter plot above compares price against the number of floors. This analysis helps us understand how the number of floors in a home relates to its sale price.

floors_counts <- table(train_df_non_linear$floors)
floors <- as.numeric(names(floors_counts))
counts <- as.numeric(floors_counts)

# Bar plot for the distribution of Floors
ggplot(data = data.frame(floors, counts), aes(x = floors, y = counts)) +
  geom_bar(stat = "identity", fill = "orange") +
  labs(title = "Distribution of Floors",
       x = "Number of Floors",
       y = "Frequency",
       caption = generate_figure_caption("Distribution of Floors", section = 5))

The bar plot above displays the distribution of the floors variable, showing the frequency of each floor count.

5.2.4 Price vs. Waterfront

ggplot(data = train_df_non_linear, aes(x = waterfront_1, y = price, group = waterfront_1)) +
  geom_boxplot(fill = "purple") +
  labs(title = "Price vs. Waterfront",
       x = "Waterfront",
       y = "Price",
       caption = generate_figure_caption("Price vs. Waterfront", section = 5),
       fill = "Waterfront",
       levels = c("No", "Yes"))  # Labels for waterfront status

In the scatter plot above, we compare price against the waterfront variable. This visualization helps us explore how having a waterfront view impacts home prices.

# Get data for bar plot
waterfront_counts <- table(train_df_non_linear$waterfront_1)
waterfront <- as.numeric(names(waterfront_counts))
counts <- as.numeric(waterfront_counts)

# Bar plot for the distribution of Waterfront
ggplot(data = data.frame(waterfront, counts), aes(x = waterfront, y = counts)) +
  geom_bar(stat = "identity", fill = "purple") +
  labs(title = "Distribution of Waterfront",
       x = "Waterfront",
       y = "Frequency",
       caption = generate_figure_caption("Distribution of Waterfront", section = 5),
       fill = "Waterfront",
       levels = c("No", "Yes"))  # Labels for waterfront status

The bar plot above visualizes the distribution of the waterfront variable, showing the frequency of waterfront and non-waterfront properties.

5.2.5 Price vs. View

# Convert view categories from dummy variables to a factor for better labeling in ggplot
train_df_non_linear$view_category <- factor(apply(train_df_non_linear[, c("view_0", "view_1", "view_2", "view_3", "view_4")], 1, function(x) which(x == 1)),
                                             labels = c("View 0", "View 1", "View 2", "View 3", "View 4"))

# Create the boxplot with ggplot2
ggplot(train_df_non_linear, aes(x = view_category, y = price)) +
  geom_boxplot(fill = "brown") +
  labs(title = "Price vs. View Quality",
       x = "View Quality",
       y = "Price",
       caption = generate_figure_caption("Boxplot of Price vs. View Quality", section = 5))

The scatter plot above compares price against the view variable, which represents the quality of the property’s view. This analysis helps us explore how the view quality impacts home prices.

# Calculate frequencies of each view quality rating
view_frequencies <- colSums(train_df_non_linear[, c("view_0", "view_1", "view_2", "view_3", "view_4")])

# Convert frequencies to data frame for ggplot2
view_df <- data.frame(View = names(view_frequencies), Frequency = view_frequencies)

# Create the bar plot with ggplot2
ggplot(view_df, aes(x = View, y = Frequency)) +
  geom_bar(stat = "identity", fill = "purple") +
  labs(title = "Distribution of View Quality",
       x = "View Quality",
       y = "Frequency",
       caption = generate_figure_caption("Distribution of View Quality", section = 5))

The bar plot above displays the distribution of the view variable, showing the frequency of different view quality ratings.

5.2.6 Price vs. Condition

# Convert condition categories from dummy variables to a factor
train_df_non_linear$condition_category <- factor(apply(train_df_non_linear[, c("condition_1", "condition_2", "condition_3", "condition_4", "condition_5")], 1, function(x) which(x == 1)),
                                                 labels = c("Condition 1", "Condition 2", "Condition 3", "Condition 4", "Condition 5"))

# Create the boxplot with ggplot2
ggplot(train_df_non_linear, aes(x = condition_category, y = price)) +
  geom_boxplot(fill = "blue") +
  labs(title = "Price vs. Condition",
       x = "Condition",
       y = "Price",
       caption = generate_figure_caption("Boxplot of Price vs. Condition", section = 5))

In the scatter plot above, we compare price against the condition variable, which represents the condition of the property. This analysis helps us explore how property condition relates to home prices.

# Calculate frequencies of each condition rating
condition_frequencies <- colSums(train_df_non_linear[, c("condition_1", "condition_2", "condition_3", "condition_4", "condition_5")])

# Convert frequencies to a data frame for ggplot2
condition_df <- data.frame(Condition = names(condition_frequencies), Frequency = condition_frequencies)

# Create the bar plot with ggplot2
ggplot(condition_df, aes(x = Condition, y = Frequency)) +
  geom_bar(stat = "identity", fill = "green") +
  labs(title = "Distribution of Condition",
       x = "Condition Rating",
       y = "Frequency",
       caption = generate_figure_caption("Distribution of Condition", section = 5))

The bar plot above visualizes the distribution of the condition variable, showing the frequency of different condition ratings.

5.2.7 Price vs. Grade

# First, identify all grade-related columns in the dataframe
grade_columns <- grep("grade_", names(train_df_non_linear), value = TRUE)

# Convert dummy variables back to a single categorical variable representing the grade
train_df_non_linear$grade_category <- apply(train_df_non_linear[, grade_columns], 1, function(row) {
  if (all(is.na(row))) {
    return(NA)  # Return NA if all values in the row are NA
  } else {
    idx <- which(row == 1, arr.ind = TRUE)
    return(if(length(idx) > 0) idx else NA)  # Return the index of the grade, or NA if none is 1
  }
})

# Extract grade labels from column names, replacing underscores with hyphens for better readability
grade_labels <- sub("grade_", "", grade_columns) # Remove 'grade_' prefix
grade_labels <- gsub("_", "-", grade_labels) # Replace underscores with hyphens

# Create a boxplot of Price vs. Grade
ggplot(train_df_non_linear, aes(x = factor(grade_category, labels = grade_labels), y = price)) +
  geom_boxplot(fill = "green") +
  labs(title = "Price vs. Grade",
       x = "Grade",
       y = "Price",
       caption = generate_figure_caption("Boxplot of Price vs. Grade", section = 5))

The scatter plot above compares price against the grade variable, which has been aggregated into categories as per the provided header. This analysis helps us explore how the grade of construction and design impacts home prices.

# Histogram for the Distribution of Grade
# Convert the grade category to a numeric variable for histogram plotting
train_df_non_linear$grade_category_numeric <- as.numeric(train_df_non_linear$grade_category)

# Define breaks for histogram
num_breaks <- length(unique(train_df_non_linear$grade_category_numeric, na.rm = TRUE))
hist_breaks <- seq(min(train_df_non_linear$grade_category_numeric, na.rm = TRUE) - 0.5,
                   max(train_df_non_linear$grade_category_numeric, na.rm = TRUE) + 0.5,
                   length.out = num_breaks + 1)

# Create a histogram with ggplot2
ggplot(train_df_non_linear, aes(x = grade_category_numeric)) +
  geom_histogram(fill = "purple", breaks = hist_breaks) +
  scale_x_continuous(breaks = seq_along(grade_labels), labels = grade_labels) +
  labs(title = "Distribution of Grade",
       x = "Grade",
       y = "Frequency",
       caption = generate_figure_caption("Histogram of Distribution of Grade", section = 5))

The bar plot above displays the distribution of the grade_category variable, showing the frequency of different grade categories.

5.3 Correlation Analysis

Understanding how continuous variables correlate with each other and, more importantly, with the target variable price.

5.3.1 Correlation Matrix

# Correlation Matrix of Numeric Variables
cor_matrix <- cor(train_df_non_linear[sapply(train_df_non_linear, is.numeric)])
# Create a table of sorted correlation values
cor_table <- as.data.frame(sort(cor_matrix[,"price"], decreasing = TRUE))

# Display the top 20 correlation values
top_20_corr <- cor_table[1:20, , drop = FALSE]
Table 5.1 Top 20 Correlation Values with Price
Variable Correlation with Price
price 1.0000000
sqft_living 0.7020794
sqft_above 0.5998325
sqft_living15 0.5931358
bathrooms 0.5202595
grade_Above_Average 0.4714246
sqft_basement 0.3307377
lat 0.3122509
bedrooms 0.3051678
view_4 0.2997715
zipcode_98004 0.2748744
floors 0.2569215
waterfront_1 0.2455379
zipcode_98040 0.2082191
view_3 0.1908930
zipcode_98112 0.1810434
zipcode_98039 0.1711855
view_2 0.1472079
zipcode_98006 0.1363141
yr_renovated 0.1276562

5.3.2 Correlation Graphics Analysis

In the tables presented above, we’ve showcased the top 20 correlation values concerning the target variable, price, with the values sorted by their absolute magnitudes. Here are some crucial observations from this analysis:

  1. Positive Correlations with Price:
    • Variables like sqft_living, sqft_above, sqft_living15, and bathrooms exhibit robust positive correlations with the target variable (price). This implies that as these features increase, house prices tend to increase correspondingly.
    • Features such as grade_11_13, view_4, and grade_8_10 also demonstrate positive correlations, indicating that properties with higher grades and better views tend to command higher prices.
  2. Negative Correlations with Price:
    • There are no negative correlations among the top 20 correlated variables. This suggests that none of the examined features strongly suggest a decrease in house price as they increase.
  3. Feature Importance:
    • The strength of these correlations provides insights into the importance of variables in predicting house prices. Variables like sqft_living and grade_11_13 emerge as strong predictors of price.
    • Location-related variables, such as zipcode_98004, zipcode_98039, and zipcode_98040, also exhibit noteworthy positive correlations, underscoring the significance of location in price determination.

5.3.3 Correlation Heatmap

# Heatmap of the top 20 correlation values
# Filter the top 20 correlation values
top_20_corr_variables <- rownames(top_20_corr)
top_20_corr_matrix <- cor_matrix[top_20_corr_variables, top_20_corr_variables]

# Create a heatmap
ggplot(melt(top_20_corr_matrix), aes(Var1, Var2, fill = value)) +
  geom_tile() +
  labs(title = "Top 20 Correlations",
       x = "Variable",
       y = "Variable",
       caption = generate_figure_caption("Heatmap showing the top 20 correlations", section = 5)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

5.3.4 Correlation Matrix for Multicollinearity

# Selecting predictors and excluding the response variable 'price'
predictors <- dplyr::select(train_df_linear, -price)

# Convert factors to numeric
numeric_predictors <- predictors %>%
  mutate(across(where(is.factor), as.numeric)) %>%
  mutate(across(where(is.character), ~as.numeric(as.factor(.))))

# Calculate the correlation matrix
corr_matrix <- cor(numeric_predictors, use = "pairwise.complete.obs")

# Convert the correlation matrix to a long format
correlated_pairs_df <- melt(corr_matrix)

# Filter out redundant pairs (keep only lower triangle of the matrix)
correlated_pairs_df <- correlated_pairs_df %>%
  filter(Var1 != Var2) %>%  # Remove self-correlations
  filter(abs(value) > 0.8) %>%
  filter(match(Var1, rownames(corr_matrix)) < match(Var2, rownames(corr_matrix)))

# Rename columns for clarity
correlated_pairs_df <- correlated_pairs_df %>%
  rename(Variable1 = Var1, Variable2 = Var2, Correlation = value)

# Output the table of highly correlated pairs using knitr::kable()
knitr_table <- kable(
  correlated_pairs_df,
  caption = generate_table_caption("Highly Correlated Variable Pairs", section = 5),
  format = "markdown"
)
print(knitr_table)
Table 5.2 Highly Correlated Variable Pairs
Variable1 Variable2 Correlation
sqft_living sqft_above 0.8744114
month_sold week_of_year 0.9955447
month_sold day_of_year 0.9958281
week_of_year day_of_year 0.9996951
condition_3 condition_4 -0.8095157
grade_Above_Average grade_Average -0.9954905

5.3.5 Detailed Explanation for Removal

5.3.5.1 sqft_above & sqft_living

The removal of sqft_above and sqft_living is justified due to their high correlation coefficient of 0.8744114. sqft_above represents the square footage of the living area above ground, while sqft_living encompasses the total square footage of living space. Since sqft_above is a subset of sqft_living, it is likely to contain redundant information, making it less valuable for our model.

5.3.5.2 month_sold & week_of_year

The variables month_sold and week_of_year exhibit a remarkably high correlation coefficient of 0.9955447. These variables are intrinsically correlated as they both pertain to the date of the house sale. While day_of_year provides the most detailed temporal information, retaining both week_of_year and month_sold may lead to multicollinearity issues. It’s advisable to consider removing one of these variables to mitigate multicollinearity while preserving the most granular date-related information.

5.3.5.3 month_sold & day_of_year

Similar to the previous case, month_sold and day_of_year demonstrate a high correlation coefficient of 0.9958281. Both variables are related to the date of the house sale. Given that day_of_year provides the most granular temporal information, it may be preferred to retain it while considering the removal of month_sold to address multicollinearity concerns.

5.3.5.4 week_of_year & day_of_year

The correlation coefficient of 0.9996951 between week_of_year and day_of_year indicates an extremely high correlation. Both variables are associated with the date of sale. Given the granularity of day_of_year, retaining it and potentially removing week_of_year can be a strategy to reduce multicollinearity while retaining essential date-related information.

5.3.5.5 condition_4 & condition_3

The variables condition_4 and condition_3 display a notable negative correlation coefficient of -0.8095157. These variables are derived from the categorical variable indicating the condition of the house. Through one-hot encoding, binary variables were created for each condition. Since these conditions are mutually exclusive, they exhibit a negative correlation. Consideration can be given to keeping one condition as a reference group and discarding the other, or reverting to using the original categorical variable to effectively capture overall house condition.

5.3.5.6 grade_Above_Average & grade_Average

The correlation coefficient of -0.9954905 between grade_Above_Average and grade_Average highlights a strong negative correlation. These variables represent different grade categories of houses. Such a high correlation suggests that retaining both variables may introduce multicollinearity into the model. Decisions can be made to keep one of these variables as a representative of house grade or explore alternative encoding strategies.

By addressing the removal of these highly correlated variable pairs, our primary goal is to mitigate multicollinearity issues. Multicollinearity can distort regression coefficient estimates, inflate standard errors, and potentially obscure the statistical significance of predictors. The objective is to retain variables that provide unique and informative contributions to the model’s prediction of house prices.

5.6 Geographical Influence Analysis

Investigating the spatial aspect by analyzing the distance_to_convergence variable.

5.6.1 Distance to Convergence Point Map

# Calculate z-scores for the prices
train_df_non_linear <- train_df_non_linear %>%
  mutate(z_score = scale(price))

# Define z-score intervals and corresponding colors
z_score_intervals <- seq(-3, 3, by = 1)  # Create a sequence of z-scores from -3 to 3
color_sequence <- c("green", "#8fd744", "#fde725", "#f76818ff", "#d7301fff", "#440154")  # From green to dark color

# Calculate price at each z-score interval
price_at_intervals <- sapply(z_score_intervals, function(z) {
  mean(train_df_non_linear$price) + z * sd(train_df_non_linear$price)
})

# Ensure breaks are in ascending order and rounded to the nearest 25k
breaks <- sort(round(price_at_intervals / 25000) * 25000)
breaks <- c(min(train_df_non_linear$price, na.rm = TRUE), breaks, max(train_df_non_linear$price, na.rm = TRUE))

# If there are negative values or values that don't make sense, remove them
breaks <- breaks[breaks >= 0]

# Create color palette with a color for each interval
color_palette <- colorBin(color_sequence, domain = train_df_non_linear$price, bins = breaks, na.color = "#808080")

# Initialize the leaflet map with updated color palette
m <- leaflet(train_df_non_linear) %>%
  addTiles() %>%
  addCircleMarkers(
    lat = ~lat, lng = ~long,
    color = ~color_palette(price),
    fillColor = ~color_palette(price),
    fillOpacity = 0.8,
    radius = 1,  # Small dots
    popup = ~paste("Price: $", formatC(price, format = "f", big.mark = ","), "<br>", "Z-Score: ", round(z_score, 2))
  )

# Define the maximum distance for the distance bands
max_distance <- max(train_df_non_linear$distance_to_convergence, na.rm = TRUE)

# Add distance bands to the map
for (i in seq(2, max_distance, by = 2)) {
  m <- addCircles(m, lat = convergence_point[1], lng = convergence_point[2], radius = i * 1000,
                  color = "grey", weight = 1, fill = FALSE, dashArray = "5, 5")
}

# Add legend and finalize the map
m <- m %>%
  addLegend(
    position = "bottomright",
    pal = color_palette,
    values = ~price,
    title = "Price",
    labFormat = labelFormat(prefix = "$"),
    opacity = 1
  ) %>%
  setView(lng = convergence_point[2], lat = convergence_point[1], zoom = 10)

cat(generate_figure_caption('Distance to Convergence Map', section = 5))

Figure 5.38 Distance to Convergence Map

5.6.2 Conclusion

This detailed review of the King County house sales dataset underscores the thorough preparation undertaken for the predictive analysis. The dataset’s diverse variables, both continuous and categorical, have been meticulously processed and analyzed, providing a robust foundation for developing the predictive model. With the comprehensive EDA and graphical analysis, we gain valuable insights into the correlations and distributions within the data, setting the stage for effective model building and accurate house price prediction.

5.7 Removal of Plot Features, Correlation, Multicollinearity and NA Values

# Drop columns created for visualizations in prior steps, columns that have high correlation, multicollinearity or NA values in the model
train_df_non_linear <- train_df_non_linear[, !colnames(train_df_non_linear) %in% c("view_category", "condition_category", "grade_category", "grade_category_numeric", "z_score", "lat", "long", 'sqft_above', 'month_sold', 'week_of_year', 'condition_3', "grade_Below_Average", "bedrooms_factor", "bathrooms_factor")]
train_df_linear <- train_df_linear[, !colnames(train_df_linear) %in% c("view_category", "condition_category", "grade_category", "grade_category_numeric", "z_score", "lat", "long", 'sqft_above', 'month_sold', 'week_of_year', 'condition_3', "grade_Below_Average", "bedrooms_factor", "bathrooms_factor")]
test_df_linear <- test_df_linear[, !colnames(test_df_linear) %in% c("view_category", "condition_category", "grade_category", "grade_category_numeric", "z_score", "lat", "long", 'sqft_above', 'month_sold', 'week_of_year', 'condition_3', "grade_Below_Average", "bedrooms_factor", "bathrooms_factor")]
test_df_non_linear <- test_df_non_linear[, !colnames(test_df_non_linear) %in% c("view_category", "condition_category", "grade_category", "grade_category_numeric", "z_score", "lat", "long", 'sqft_above', 'month_sold', 'week_of_year', 'condition_3', "grade_Below_Average", "bedrooms_factor", "bathrooms_factor")]

# Rebuild linear regression model before performing stepwise
linear_model_initial <- lm(price ~ ., data = train_df_linear)

# Add new coefficients to dataframe
coefficients_df <- create_coefficients_df(
  model = linear_model_initial,
  model_name = "OLS w/o corr",
  coefficients_df = coefficients_df
)

6. Stepwise Model Selection

6.1 Stepwise Methodology

# Define a function to fetch dropped features from models and create a kable table
get_dropped_features_table <- function(model_backward, model_forward, best_linear_model, train_df_linear) {
  # Create a data frame to store the results
  results_df <- data.frame(Stepwise_Method = character(), Dropped_Features = character(), stringsAsFactors = FALSE)

  # All possible features (excluding the response variable 'price')
  all_possible_features <- setdiff(colnames(train_df_linear), "price")

  # Function to find the dropped features for a given model
  get_dropped_features <- function(model, all_features) {
    included_features <- names(coef(model))
    dropped_features <- setdiff(all_features, included_features)
    return(paste(dropped_features, collapse = ", "))
  }

  # Get the dropped features for each model
  dropped_backward <- get_dropped_features(model_backward, all_possible_features)
  dropped_forward <- get_dropped_features(model_forward, all_possible_features)
  dropped_both <- get_dropped_features(best_linear_model, all_possible_features)

  # Add results to the data frame
  results_df <- rbind(results_df, data.frame(Stepwise_Method = "OLS_Step_Backward", Dropped_Features = dropped_backward))
  results_df <- rbind(results_df, data.frame(Stepwise_Method = "OLS_Step_Forward", Dropped_Features = dropped_forward))
  results_df <- rbind(results_df, data.frame(Stepwise_Method = "OLS_Step_Both", Dropped_Features = dropped_both))

  # Create the kable table
  table <- kable(
    results_df,
    format = "html",
    caption = generate_table_caption("Dropped Features", section = 6),
    col.names = c("Stepwise Method", "Dropped Features")
  ) %>%
    kable_styling(full_width = TRUE)

  return(table)
}

# Conditional logic based on the update_model_parameters flag
if (update_model_parameters) {
  # Run each of the stepwise regression models and update the JSON file
  # Step model both
  best_linear_model <- ols_step_both_p(linear_model_initial, pent=0.35, prem=0.05)
  features_both <- setdiff(names(coef(best_linear_model$model)), "(Intercept)")

  # Create a formula with the selected features
  formula <- reformulate(features_forward, response = "price")

  # Refit the model with the selected features
  best_linear_model <- lm(formula, data = train_df_linear)

  # Update JSON
  update_model_json("OLS_Step_Both", features_both, json_filepath)

  # Step model Backward
  model_backward <- ols_step_backward_p(linear_model_initial, pent=0.35, prem = 0.05)
  features_backward <- setdiff(names(coef(model_backward$model)), "(Intercept)")

  # Create a formula with the selected features
  formula <- reformulate(features_backward, response = "price")

  # Refit the model with the selected features
  model_backward <- lm(formula, data = train_df_linear)

  # Update JSON
  update_model_json("OLS_Step_Backward", features_backward, json_filepath)

  # Step model foward
  model_forward <- ols_step_forward_p(linear_model_initial, pent=0.35, penter = 0.05)
  features_forward <- setdiff(names(coef(model_forward$model)), "(Intercept)")

  # Create a formula with the selected features
  formula <- reformulate(features_forward, response = "price")

  # Refit the model with the selected features
  model_forward <- lm(formula, data = train_df_linear)

  # Update JSON
  update_model_json("OLS_Step_Forward", features_forward, json_filepath)

  # Load model parameters from JSON and build models
  model_params <- fromJSON(json_filepath)
  # List of features for each model from model_params
  features_both <- model_params$OLS_Step_Both
  features_backward <- model_params$OLS_Step_Backward
  features_forward <- model_params$OLS_Step_Forward

  # Create a temporary dataframe for each model with the selected features
  train_df_both <- train_df_linear[, c("price", features_both)]
  test_df_both <- test_df_linear[, c("price", features_both)]

  train_df_backward <- train_df_linear[, c("price", features_backward)]
  test_df_backward <- test_df_linear[, c("price", features_backward)]

  train_df_forward <- train_df_linear[, c("price", features_forward)]
  test_df_forward <- test_df_linear[, c("price", features_forward)]

  # Fit the linear models using the temporary dataframes
  best_linear_model <- lm(price ~ ., data = train_df_both)
  model_backward <- lm(price ~ ., data = train_df_backward)
  model_forward <- lm(price ~ ., data = train_df_forward)

} else {
  # Load model parameters from JSON and build models
  model_params <- fromJSON(json_filepath)

  # Create models based on the loaded features
  if (all(c("OLS_Step_Both", "OLS_Step_Backward", "OLS_Step_Forward") %in% names(model_params))) {
    # For each model type, create the model using the features stored in the JSON
    # List of features for each model from model_params
    features_both <- model_params$OLS_Step_Both
    features_backward <- model_params$OLS_Step_Backward
    features_forward <- model_params$OLS_Step_Forward

    # Create a temporary dataframe for each model with the selected features
    train_df_both <- train_df_linear[, c("price", features_both)]
    test_df_both <- test_df_linear[, c("price", features_both)]

    train_df_backward <- train_df_linear[, c("price", features_backward)]
    test_df_backward <- test_df_linear[, c("price", features_backward)]

    train_df_forward <- train_df_linear[, c("price", features_forward)]
    test_df_forward <- test_df_linear[, c("price", features_forward)]

    # Fit the linear models using the temporary dataframes
    best_linear_model <- lm(price ~ ., data = train_df_both)
    model_backward <- lm(price ~ ., data = train_df_backward)
    model_forward <- lm(price ~ ., data = train_df_forward)
  } else {
    stop("Required model parameters are missing in the JSON file.")
  }
}

# Add coefficients to dataframe
coefficients_df <- create_coefficients_df(
  model = best_linear_model,
  model_name = "Inital Step Both",
  coefficients_df = coefficients_df
)

# Add coefficients to dataframe
coefficients_df <- create_coefficients_df(
  model = model_forward,
  model_name = "Step Forward",
  coefficients_df = coefficients_df
)

# Add coefficients to dataframe
coefficients_df <- create_coefficients_df(
  model = model_backward,
  model_name = "Step Backward",
  coefficients_df = coefficients_df
)

# Evaluate OLS_Step_Both
df_results <- evaluate_model("OLS_Step_Both", best_linear_model, train_df_both, test_df_both, target_var = 'price', df_results)

# Evaluate OLS_Step_Backward
df_results <- evaluate_model("OLS_Step_Backward", model_backward, train_df_backward, test_df_backward, target_var = 'price', df_results)

# Evaluate OLS_Step_Forward
df_results <- evaluate_model("OLS_Step_Forward", model_forward, train_df_forward, test_df_forward, target_var = 'price', df_results)

# featch all the dropped features from each of the models
dropped_features_table <- get_dropped_features_table(model_backward, model_forward, best_linear_model, train_df_linear)
Table 6.1 Dropped Features
Stepwise Method Dropped Features
OLS_Step_Backward yr_built, sqft_lot15, condition_2, grade_Average, season_Summer
OLS_Step_Forward sqft_lot15, day_sold, condition_2, season_Fall
OLS_Step_Both yr_built, sqft_lot15, day_sold, zipcode_98010, zipcode_98022, zipcode_98029, zipcode_98033, zipcode_98045, zipcode_98115, zipcode_98116, zipcode_98122, zipcode_98136, zipcode_98177, condition_2, grade_Above_Average, season_Fall
# Display model results
view_model_results(df_results, generate_table_caption("Step Model Additions", section = 6))

6.2 Best Linear Model

summary(best_linear_model)
## 
## Call:
## lm(formula = price ~ ., data = train_df_both)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1416431   -64248     1685    59901  4234646 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -1.013e+08  1.030e+07  -9.832  < 2e-16 ***
## bathrooms                2.292e+04  2.798e+03   8.193 2.74e-16 ***
## sqft_living              2.115e+02  3.283e+00  64.427  < 2e-16 ***
## sqft_basement           -7.342e+01  4.120e+00 -17.822  < 2e-16 ***
## sqft_living15            3.255e+01  3.246e+00  10.026  < 2e-16 ***
## grade_Average           -3.520e+05  9.697e+03 -36.298  < 2e-16 ***
## distance_to_convergence -9.300e+03  2.496e+02 -37.260  < 2e-16 ***
## waterfront_0            -5.190e+05  1.905e+04 -27.239  < 2e-16 ***
## zipcode_98004            3.622e+05  1.176e+04  30.789  < 2e-16 ***
## zipcode_98112            3.059e+05  1.232e+04  24.828  < 2e-16 ***
## zipcode_98039            6.904e+05  2.738e+04  25.214  < 2e-16 ***
## view_4                   3.414e+05  1.341e+04  25.462  < 2e-16 ***
## bedrooms                -2.159e+04  1.730e+03 -12.480  < 2e-16 ***
## zipcode_98040            1.517e+05  1.221e+04  12.427  < 2e-16 ***
## view_3                   1.742e+05  8.979e+03  19.405  < 2e-16 ***
## zipcode_98119            1.919e+05  1.357e+04  14.147  < 2e-16 ***
## zipcode_98105            1.466e+05  1.306e+04  11.227  < 2e-16 ***
## zipcode_98109            1.983e+05  1.793e+04  11.059  < 2e-16 ***
## zipcode_98178           -2.705e+05  1.218e+04 -22.209  < 2e-16 ***
## zipcode_98102            2.024e+05  1.739e+04  11.635  < 2e-16 ***
## zipcode_98117            3.302e+04  8.681e+03   3.804 0.000143 ***
## zipcode_98103            2.928e+04  8.533e+03   3.432 0.000602 ***
## view_2                   7.160e+04  6.400e+03  11.187  < 2e-16 ***
## zipcode_98107            4.996e+04  1.195e+04   4.180 2.93e-05 ***
## zipcode_98059           -2.278e+05  9.289e+03 -24.523  < 2e-16 ***
## year_sold                5.087e+04  5.113e+03   9.950  < 2e-16 ***
## condition_5              7.211e+04  4.940e+03  14.595  < 2e-16 ***
## zipcode_98056           -2.288e+05  9.768e+03 -23.426  < 2e-16 ***
## zipcode_98058           -2.232e+05  9.472e+03 -23.563  < 2e-16 ***
## zipcode_98168           -2.283e+05  1.208e+04 -18.903  < 2e-16 ***
## zipcode_98055           -2.272e+05  1.174e+04 -19.354  < 2e-16 ***
## zipcode_98118           -1.782e+05  9.168e+03 -19.437  < 2e-16 ***
## zipcode_98108           -2.072e+05  1.369e+04 -15.135  < 2e-16 ***
## sqft_lot                 2.496e-01  3.583e-02   6.968 3.36e-12 ***
## zipcode_98106           -1.759e+05  1.054e+04 -16.688  < 2e-16 ***
## zipcode_98146           -1.885e+05  1.174e+04 -16.060  < 2e-16 ***
## yr_renovated             3.162e+01  3.213e+00   9.842  < 2e-16 ***
## zipcode_98031           -2.069e+05  1.178e+04 -17.568  < 2e-16 ***
## zipcode_98188           -2.298e+05  1.676e+04 -13.710  < 2e-16 ***
## condition_4              2.516e+04  3.164e+03   7.952 1.96e-15 ***
## view_1                   7.131e+04  1.036e+04   6.881 6.18e-12 ***
## zipcode_98166           -1.885e+05  1.187e+04 -15.883  < 2e-16 ***
## zipcode_98198           -1.984e+05  1.223e+04 -16.224  < 2e-16 ***
## zipcode_98028           -1.686e+05  1.199e+04 -14.057  < 2e-16 ***
## floors                  -3.155e+04  3.434e+03  -9.189  < 2e-16 ***
## zipcode_98042           -1.703e+05  9.138e+03 -18.633  < 2e-16 ***
## zipcode_98030           -1.787e+05  1.241e+04 -14.404  < 2e-16 ***
## zipcode_98011           -1.672e+05  1.400e+04 -11.942  < 2e-16 ***
## zipcode_98070           -1.860e+05  1.801e+04 -10.323  < 2e-16 ***
## zipcode_98032           -1.829e+05  1.741e+04 -10.508  < 2e-16 ***
## zipcode_98155           -1.328e+05  9.443e+03 -14.063  < 2e-16 ***
## zipcode_98034           -1.205e+05  8.865e+03 -13.591  < 2e-16 ***
## season_Spring            2.006e+04  3.411e+03   5.880 4.18e-09 ***
## day_of_year              1.131e+02  2.457e+01   4.602 4.22e-06 ***
## zipcode_98077           -1.495e+05  1.432e+04 -10.440  < 2e-16 ***
## zipcode_98148           -1.798e+05  2.276e+04  -7.903 2.91e-15 ***
## season_Summer            1.058e+04  3.606e+03   2.933 0.003360 ** 
## zipcode_98125           -1.101e+05  9.904e+03 -11.113  < 2e-16 ***
## zipcode_98072           -1.236e+05  1.204e+04 -10.266  < 2e-16 ***
## zipcode_98133           -1.086e+05  9.027e+03 -12.027  < 2e-16 ***
## zipcode_98074           -1.079e+05  9.617e+03 -11.215  < 2e-16 ***
## zipcode_98019           -1.343e+05  1.377e+04  -9.757  < 2e-16 ***
## zipcode_98092           -1.431e+05  1.121e+04 -12.770  < 2e-16 ***
## zipcode_98038           -1.273e+05  8.995e+03 -14.157  < 2e-16 ***
## zipcode_98023           -1.284e+05  1.023e+04 -12.550  < 2e-16 ***
## zipcode_98001           -1.229e+05  1.097e+04 -11.204  < 2e-16 ***
## zipcode_98003           -1.222e+05  1.254e+04  -9.749  < 2e-16 ***
## zipcode_98002           -1.138e+05  1.401e+04  -8.124 4.88e-16 ***
## zipcode_98065           -9.718e+04  1.142e+04  -8.511  < 2e-16 ***
## zipcode_98126           -9.718e+04  1.065e+04  -9.123  < 2e-16 ***
## zipcode_98027           -9.378e+04  9.795e+03  -9.574  < 2e-16 ***
## zipcode_98075           -9.893e+04  1.064e+04  -9.294  < 2e-16 ***
## zipcode_98053           -8.840e+04  1.030e+04  -8.584  < 2e-16 ***
## zipcode_98052           -8.713e+04  8.528e+03 -10.217  < 2e-16 ***
## zipcode_98008           -9.692e+04  1.194e+04  -8.120 5.03e-16 ***
## zipcode_98006           -7.846e+04  9.535e+03  -8.228  < 2e-16 ***
## zipcode_98144           -8.192e+04  1.072e+04  -7.642 2.27e-14 ***
## zipcode_98007           -1.053e+05  1.632e+04  -6.453 1.13e-10 ***
## zipcode_98005           -7.554e+04  1.489e+04  -5.073 3.96e-07 ***
## zipcode_98014           -7.135e+04  1.788e+04  -3.992 6.59e-05 ***
## zipcode_98024           -7.353e+04  2.154e+04  -3.414 0.000642 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 155500 on 15048 degrees of freedom
## Multiple R-squared:  0.8161, Adjusted R-squared:  0.8152 
## F-statistic:   835 on 80 and 15048 DF,  p-value: < 2.2e-16

6.3 Model Comparison

In our analysis of different regression models, we evaluated four models: OLS_linear, OLS_Step_Both, OLS_Step_Backward, and OLS_Step_Forward. These models were assessed based on several key metrics, including the sum of squared errors (SSE) for both the training and testing datasets, the coefficient of determination (R-squared) for both training and testing, root mean square error (RMSE) for training and testing, and mean absolute error (MAE) for training and testing.

The OLS_Step_Both model emerged as a compelling choice due to its unique characteristics. While it exhibited a slight reduction in model performance compared to the OLS_linear model, it showcased a distinctive feature selection process. OLS_Step_Both effectively removes a substantial number of features, making it the least complex model among the four alternatives. This feature reduction enhances model interpretability and simplicity, which can be particularly valuable in scenarios where we seek to understand the most influential variables while maintaining competitive predictive power.

The trade-off between complexity and performance makes OLS_Step_Both an attractive option for specific use cases. If the primary goal is to build a model that strikes a balance between simplicity and accuracy, the OLS_Step_Both model offers a pragmatic solution. However, after comprehensive consideration, we ultimately select the OLS_linear model as the preferred choice for our regression analysis. It delivers strong overall performance across various metrics, making it a robust and versatile model for our dataset and problem.


7. Model Assumption Checks

7.1 Linearity Assumption

7.1.1 Residuals vs Fitted Plot

The residuals vs fitted plot is our first stop to assess the assumption of linearity. The ideal scenario is a random spread of residuals around the horizontal axis, indicating a linear relationship between predictors and the response.

Figure 7.1 Residuals vs Fitted Plot for Linear Model

7.1.2 Residuals vs Leverage Plot

We use the ols_plot_resid_lev function to create a plot for detecting outliers and observations with high leverage.

Figure 7.2 OLS Outlier and Leverage Diagnostics Linear Model

7.1.3 Residuals vs Fitted Plot

We use the ols_plot_resid_stud_fit function to create a plot that helps detect non-linearity, constant variances, and outliers in residuals.

Figure 7.3 Deleted Studentized Resid. vs Pred. Linear Model

7.2 Normality of Residuals

7.2.1 Normal Q-Q Plot

The Q-Q plot offers a visual comparison of the distribution of residuals against a perfectly normal distribution. Deviations from the diagonal indicate departures from normality.

Figure 7.4 QQ Plot for Linear Model

7.3 Homoscedasticity (Constant Variance)

7.3.1 Scale-Location Plot

Also known as the spread-location plot, it’s used to check for equal variance of residuals (homoscedasticity).

Figure 7.5 Scale-Location Plot for Linear Model

7.4 Outliers and Influential Points

7.4.1 Cook’s Distance Plot

The Cook’s distance plot is instrumental in quantifying the influence of each data point.

Figure 7.6 Cooks Distance Plot for Linear Model

7.4.2 Cooks Distance with Threshold

Figure 7.7 OLS Cooks Distance Plot for Linear Model

7.4.3 Filtering Outliers

We filter the data points with Cook’s distance >= 0.04, which is a reasonable threshold for identifying influential data points according to industry standards.

# Define the threshold for Cook's distance, opting for slightly lower than .04 as it benefits the model performance to do a little bit lower
threshold <- 0.04

# Identify the indices of influential observations
influential_obs <- which(cooksd > threshold)

# Output what observations were influential prior to removal
print(influential_obs)
##  5127  9986 13244 
##  5127  9986 13244
# Remove outliers from both datasets
train_df_both <- train_df_both[-influential_obs, ]

# re-fit the linear model without outliers
best_linear_model <- lm(price ~ ., data = train_df_both)
summary(best_linear_model)
## 
## Call:
## lm(formula = price ~ ., data = train_df_both)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1359328   -64137      589    58324  2106190 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -9.686e+07  9.896e+06  -9.787  < 2e-16 ***
## bathrooms                2.180e+04  2.687e+03   8.112 5.36e-16 ***
## sqft_living              2.013e+02  3.167e+00  63.583  < 2e-16 ***
## sqft_basement           -7.199e+01  3.957e+00 -18.194  < 2e-16 ***
## sqft_living15            3.946e+01  3.123e+00  12.633  < 2e-16 ***
## grade_Average           -3.391e+05  9.321e+03 -36.386  < 2e-16 ***
## distance_to_convergence -9.304e+03  2.397e+02 -38.821  < 2e-16 ***
## waterfront_0            -5.162e+05  1.837e+04 -28.104  < 2e-16 ***
## zipcode_98004            3.563e+05  1.131e+04  31.490  < 2e-16 ***
## zipcode_98112            3.073e+05  1.183e+04  25.980  < 2e-16 ***
## zipcode_98039            6.988e+05  2.629e+04  26.578  < 2e-16 ***
## view_4                   3.298e+05  1.291e+04  25.541  < 2e-16 ***
## bedrooms                -1.939e+04  1.663e+03 -11.662  < 2e-16 ***
## zipcode_98040            1.455e+05  1.174e+04  12.386  < 2e-16 ***
## view_3                   1.666e+05  8.630e+03  19.303  < 2e-16 ***
## zipcode_98119            1.916e+05  1.303e+04  14.705  < 2e-16 ***
## zipcode_98105            1.460e+05  1.254e+04  11.640  < 2e-16 ***
## zipcode_98109            1.978e+05  1.722e+04  11.487  < 2e-16 ***
## zipcode_98178           -2.717e+05  1.170e+04 -23.230  < 2e-16 ***
## zipcode_98102            1.514e+05  1.679e+04   9.014  < 2e-16 ***
## zipcode_98117            3.255e+04  8.336e+03   3.905 9.48e-05 ***
## zipcode_98103            2.717e+04  8.193e+03   3.316 0.000914 ***
## view_2                   7.466e+04  6.146e+03  12.147  < 2e-16 ***
## zipcode_98107            4.726e+04  1.148e+04   4.118 3.84e-05 ***
## zipcode_98059           -2.278e+05  8.919e+03 -25.543  < 2e-16 ***
## year_sold                4.866e+04  4.911e+03   9.908  < 2e-16 ***
## condition_5              7.279e+04  4.744e+03  15.343  < 2e-16 ***
## zipcode_98056           -2.300e+05  9.379e+03 -24.517  < 2e-16 ***
## zipcode_98058           -2.238e+05  9.095e+03 -24.606  < 2e-16 ***
## zipcode_98168           -2.305e+05  1.160e+04 -19.879  < 2e-16 ***
## zipcode_98055           -2.289e+05  1.127e+04 -20.308  < 2e-16 ***
## zipcode_98118           -1.793e+05  8.803e+03 -20.367  < 2e-16 ***
## zipcode_98108           -2.086e+05  1.315e+04 -15.867  < 2e-16 ***
## sqft_lot                 2.718e-01  3.441e-02   7.901 2.96e-15 ***
## zipcode_98106           -1.778e+05  1.012e+04 -17.566  < 2e-16 ***
## zipcode_98146           -1.886e+05  1.127e+04 -16.735  < 2e-16 ***
## yr_renovated             2.986e+01  3.087e+00   9.675  < 2e-16 ***
## zipcode_98031           -2.081e+05  1.131e+04 -18.400  < 2e-16 ***
## zipcode_98188           -2.304e+05  1.609e+04 -14.316  < 2e-16 ***
## condition_4              2.317e+04  3.039e+03   7.624 2.60e-14 ***
## view_1                   7.444e+04  9.951e+03   7.481 7.81e-14 ***
## zipcode_98166           -1.863e+05  1.140e+04 -16.345  < 2e-16 ***
## zipcode_98198           -1.980e+05  1.174e+04 -16.866  < 2e-16 ***
## zipcode_98028           -1.690e+05  1.151e+04 -14.682  < 2e-16 ***
## floors                  -2.844e+04  3.299e+03  -8.623  < 2e-16 ***
## zipcode_98042           -1.714e+05  8.775e+03 -19.531  < 2e-16 ***
## zipcode_98030           -1.803e+05  1.191e+04 -15.131  < 2e-16 ***
## zipcode_98011           -1.681e+05  1.345e+04 -12.501  < 2e-16 ***
## zipcode_98070           -1.843e+05  1.730e+04 -10.654  < 2e-16 ***
## zipcode_98032           -1.834e+05  1.672e+04 -10.969  < 2e-16 ***
## zipcode_98155           -1.336e+05  9.067e+03 -14.731  < 2e-16 ***
## zipcode_98034           -1.210e+05  8.513e+03 -14.211  < 2e-16 ***
## season_Spring            1.924e+04  3.276e+03   5.872 4.40e-09 ***
## day_of_year              1.042e+02  2.359e+01   4.418 1.00e-05 ***
## zipcode_98077           -1.476e+05  1.375e+04 -10.737  < 2e-16 ***
## zipcode_98148           -1.814e+05  2.185e+04  -8.304  < 2e-16 ***
## season_Summer            1.060e+04  3.463e+03   3.061 0.002207 ** 
## zipcode_98125           -1.111e+05  9.510e+03 -11.683  < 2e-16 ***
## zipcode_98072           -1.233e+05  1.156e+04 -10.665  < 2e-16 ***
## zipcode_98133           -1.105e+05  8.668e+03 -12.745  < 2e-16 ***
## zipcode_98074           -1.070e+05  9.235e+03 -11.585  < 2e-16 ***
## zipcode_98019           -1.357e+05  1.322e+04 -10.269  < 2e-16 ***
## zipcode_98092           -1.437e+05  1.076e+04 -13.354  < 2e-16 ***
## zipcode_98038           -1.287e+05  8.637e+03 -14.902  < 2e-16 ***
## zipcode_98023           -1.288e+05  9.827e+03 -13.105  < 2e-16 ***
## zipcode_98001           -1.242e+05  1.053e+04 -11.790  < 2e-16 ***
## zipcode_98003           -1.228e+05  1.204e+04 -10.203  < 2e-16 ***
## zipcode_98002           -1.143e+05  1.346e+04  -8.495  < 2e-16 ***
## zipcode_98065           -9.733e+04  1.096e+04  -8.877  < 2e-16 ***
## zipcode_98126           -9.825e+04  1.023e+04  -9.606  < 2e-16 ***
## zipcode_98027           -9.325e+04  9.405e+03  -9.915  < 2e-16 ***
## zipcode_98075           -9.728e+04  1.022e+04  -9.517  < 2e-16 ***
## zipcode_98053           -8.705e+04  9.889e+03  -8.803  < 2e-16 ***
## zipcode_98052           -8.725e+04  8.189e+03 -10.654  < 2e-16 ***
## zipcode_98008           -9.622e+04  1.146e+04  -8.394  < 2e-16 ***
## zipcode_98006           -7.440e+04  9.156e+03  -8.126 4.80e-16 ***
## zipcode_98144           -8.200e+04  1.029e+04  -7.966 1.75e-15 ***
## zipcode_98007           -1.060e+05  1.567e+04  -6.767 1.36e-11 ***
## zipcode_98005           -7.296e+04  1.430e+04  -5.103 3.38e-07 ***
## zipcode_98014           -7.122e+04  1.716e+04  -4.149 3.36e-05 ***
## zipcode_98024           -7.325e+04  2.068e+04  -3.542 0.000399 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 149300 on 15045 degrees of freedom
## Multiple R-squared:  0.8218, Adjusted R-squared:  0.8208 
## F-statistic: 867.1 on 80 and 15045 DF,  p-value: < 2.2e-16
df_results <- evaluate_model("OLS_Step_Both_Outliers", best_linear_model, train_df_both, test_df_both, target_var = 'price', df_results)

# Add coefficients to dataframe
coefficients_df <- create_coefficients_df(
  model = best_linear_model,
  model_name = "OLS Both outliers",
  coefficients_df = coefficients_df
)

8. Addressing Linearity, Multicollinearity, Normality and Heteroscedasticity

8.1 Examining the Non-Linearity of the Model

Figure 8.1 Box-Cox Plot for Linear Model

As we can observe from the box cox plot above, the lambda that give the highest log-likehood is 0.0909091. We transform Y using the log transformation as shown below.

8.1.1 Transformation of Model

# Function to transform a single value back to the original scale
inverse_transform <- function(value, lambda=optimal_lambda) {
  if (lambda != 0) {
    return((value * lambda + 1)^(1 / lambda))
  } else {
    return(exp(value))
  }
}

# Function to calculate SSE, RMSE, MAE, and SST with or without inverse transformation
calculate_metrics <- function(actual, predicted, lambda = optimal_lambda) {
  if (lambda != 0) {
    # Apply inverse transformation to actual and predicted values
    actual_original <- inverse_transform(actual, lambda)
    predicted_original <- inverse_transform(predicted, lambda)

    # Calculate SSE, RMSE, MAE, and SST
    sse <- sum((actual_original - predicted_original)^2)
    rmse <- sqrt(sse / length(actual_original))
    mae <- mean(abs(actual_original - predicted_original))
    sst <- sum((actual_original - mean(actual_original))^2)
  } else {
    # Calculate SSE, RMSE, MAE, and SST without transformation
    sse <- sum((actual - predicted)^2)
    rmse <- sqrt(sse / length(actual))
    mae <- mean(abs(actual - predicted))
    sst <- sum((actual - mean(actual))^2)
  }

  return(list(sse = sse, rmse = rmse, mae = mae, sst = sst))
}

# Apply the Box-Cox transformation to the training and test data
if (optimal_lambda != 0) {
  transformed_price <- (train_df_both$price^optimal_lambda - 1) / optimal_lambda
  transformed_test_price <- (test_df_both$price^optimal_lambda - 1) / optimal_lambda
} else {
  transformed_price <- log(train_df_both$price)
  transformed_test_price <- log(test_df_both$price)
}

# Create a linear regression model using the transformed price
transformed_model <- lm(transformed_price ~ ., data = dplyr::select(train_df_both, -price))

summary(transformed_model)
## 
## Call:
## lm(formula = transformed_price ~ ., data = dplyr::select(train_df_both, 
##     -price))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4702 -0.3274  0.0294  0.3550  3.3390 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -5.602e+02  4.112e+01 -13.625  < 2e-16 ***
## bathrooms                1.680e-01  1.116e-02  15.050  < 2e-16 ***
## sqft_living              8.984e-04  1.316e-05  68.287  < 2e-16 ***
## sqft_basement           -3.453e-04  1.644e-05 -21.001  < 2e-16 ***
## sqft_living15            3.761e-04  1.298e-05  28.982  < 2e-16 ***
## grade_Average           -9.368e-02  3.873e-02  -2.419   0.0156 *  
## distance_to_convergence -5.969e-02  9.958e-04 -59.947  < 2e-16 ***
## waterfront_0            -1.508e+00  7.631e-02 -19.763  < 2e-16 ***
## zipcode_98004            7.198e-01  4.701e-02  15.313  < 2e-16 ***
## zipcode_98112            7.708e-01  4.915e-02  15.683  < 2e-16 ***
## zipcode_98039            1.247e+00  1.092e-01  11.417  < 2e-16 ***
## view_4                   1.024e+00  5.365e-02  19.096  < 2e-16 ***
## bedrooms                -3.523e-02  6.909e-03  -5.099 3.45e-07 ***
## zipcode_98040            1.054e-01  4.880e-02   2.160   0.0308 *  
## view_3                   6.726e-01  3.585e-02  18.760  < 2e-16 ***
## zipcode_98119            8.285e-01  5.413e-02  15.306  < 2e-16 ***
## zipcode_98105            4.431e-01  5.210e-02   8.505  < 2e-16 ***
## zipcode_98109            7.553e-01  7.153e-02  10.560  < 2e-16 ***
## zipcode_98178           -2.002e+00  4.859e-02 -41.209  < 2e-16 ***
## zipcode_98102            5.634e-01  6.977e-02   8.075 7.24e-16 ***
## zipcode_98117            2.955e-01  3.463e-02   8.532  < 2e-16 ***
## zipcode_98103            1.709e-01  3.404e-02   5.021 5.21e-07 ***
## view_2                   3.922e-01  2.554e-02  15.358  < 2e-16 ***
## zipcode_98107            3.184e-01  4.768e-02   6.678 2.51e-11 ***
## zipcode_98059           -1.286e+00  3.706e-02 -34.708  < 2e-16 ***
## year_sold                2.907e-01  2.040e-02  14.247  < 2e-16 ***
## condition_5              3.668e-01  1.971e-02  18.611  < 2e-16 ***
## zipcode_98056           -1.489e+00  3.897e-02 -38.218  < 2e-16 ***
## zipcode_98058           -1.493e+00  3.779e-02 -39.498  < 2e-16 ***
## zipcode_98168           -2.042e+00  4.818e-02 -42.384  < 2e-16 ***
## zipcode_98055           -1.747e+00  4.683e-02 -37.295  < 2e-16 ***
## zipcode_98118           -1.245e+00  3.657e-02 -34.049  < 2e-16 ***
## zipcode_98108           -1.427e+00  5.462e-02 -26.123  < 2e-16 ***
## sqft_lot                 2.376e-06  1.430e-07  16.621  < 2e-16 ***
## zipcode_98106           -1.422e+00  4.204e-02 -33.820  < 2e-16 ***
## zipcode_98146           -1.416e+00  4.682e-02 -30.245  < 2e-16 ***
## yr_renovated             1.358e-04  1.283e-05  10.586  < 2e-16 ***
## zipcode_98031           -1.540e+00  4.698e-02 -32.783  < 2e-16 ***
## zipcode_98188           -1.789e+00  6.687e-02 -26.752  < 2e-16 ***
## condition_4              1.544e-01  1.263e-02  12.230  < 2e-16 ***
## view_1                   3.840e-01  4.134e-02   9.287  < 2e-16 ***
## zipcode_98166           -1.037e+00  4.736e-02 -21.891  < 2e-16 ***
## zipcode_98198           -1.446e+00  4.878e-02 -29.642  < 2e-16 ***
## zipcode_98028           -9.154e-01  4.784e-02 -19.135  < 2e-16 ***
## floors                  -6.739e-02  1.371e-02  -4.917 8.89e-07 ***
## zipcode_98042           -1.298e+00  3.646e-02 -35.605  < 2e-16 ***
## zipcode_98030           -1.407e+00  4.950e-02 -28.423  < 2e-16 ***
## zipcode_98011           -8.618e-01  5.586e-02 -15.427  < 2e-16 ***
## zipcode_98070           -5.711e-01  7.188e-02  -7.945 2.08e-15 ***
## zipcode_98032           -1.569e+00  6.945e-02 -22.584  < 2e-16 ***
## zipcode_98155           -8.571e-01  3.767e-02 -22.752  < 2e-16 ***
## zipcode_98034           -7.224e-01  3.537e-02 -20.425  < 2e-16 ***
## season_Spring            1.247e-01  1.361e-02   9.162  < 2e-16 ***
## day_of_year              6.417e-04  9.803e-05   6.545 6.13e-11 ***
## zipcode_98077           -5.770e-01  5.711e-02 -10.103  < 2e-16 ***
## zipcode_98148           -1.516e+00  9.079e-02 -16.699  < 2e-16 ***
## season_Summer            8.471e-02  1.439e-02   5.887 4.01e-09 ***
## zipcode_98125           -6.053e-01  3.951e-02 -15.320  < 2e-16 ***
## zipcode_98072           -5.819e-01  4.803e-02 -12.117  < 2e-16 ***
## zipcode_98133           -7.318e-01  3.601e-02 -20.320  < 2e-16 ***
## zipcode_98074           -4.657e-01  3.837e-02 -12.138  < 2e-16 ***
## zipcode_98019           -7.618e-01  5.492e-02 -13.872  < 2e-16 ***
## zipcode_98092           -1.011e+00  4.471e-02 -22.616  < 2e-16 ***
## zipcode_98038           -8.612e-01  3.589e-02 -23.997  < 2e-16 ***
## zipcode_98023           -1.088e+00  4.083e-02 -26.660  < 2e-16 ***
## zipcode_98001           -1.144e+00  4.377e-02 -26.129  < 2e-16 ***
## zipcode_98003           -1.021e+00  5.003e-02 -20.403  < 2e-16 ***
## zipcode_98002           -1.293e+00  5.591e-02 -23.134  < 2e-16 ***
## zipcode_98065           -3.761e-01  4.555e-02  -8.256  < 2e-16 ***
## zipcode_98126           -6.307e-01  4.250e-02 -14.841  < 2e-16 ***
## zipcode_98027           -4.180e-01  3.908e-02 -10.698  < 2e-16 ***
## zipcode_98075           -4.456e-01  4.247e-02 -10.494  < 2e-16 ***
## zipcode_98053           -3.637e-01  4.109e-02  -8.852  < 2e-16 ***
## zipcode_98052           -3.983e-01  3.402e-02 -11.707  < 2e-16 ***
## zipcode_98008           -5.176e-01  4.762e-02 -10.869  < 2e-16 ***
## zipcode_98006           -5.343e-01  3.804e-02 -14.044  < 2e-16 ***
## zipcode_98144           -6.792e-01  4.277e-02 -15.881  < 2e-16 ***
## zipcode_98007           -5.803e-01  6.511e-02  -8.913  < 2e-16 ***
## zipcode_98005           -4.024e-01  5.940e-02  -6.774 1.30e-11 ***
## zipcode_98014           -4.854e-01  7.132e-02  -6.806 1.04e-11 ***
## zipcode_98024           -4.948e-01  8.593e-02  -5.759 8.65e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6204 on 15045 degrees of freedom
## Multiple R-squared:  0.8726, Adjusted R-squared:  0.8719 
## F-statistic:  1288 on 80 and 15045 DF,  p-value: < 2.2e-16
# Add coefficients to dataframe
coefficients_df <- create_coefficients_df(
  model = transformed_model,
  model_name = "Transformed Model",
  coefficients_df = coefficients_df
)

# Predict prices for training and test data using the transformed model
predicted_train_transformed <- predict(transformed_model, newdata = train_df_both)
predicted_test_transformed <- predict(transformed_model, newdata = test_df_both)

# Calculate other metrics for training and test data
metrics_train <- calculate_metrics(transformed_price, predicted_train_transformed, optimal_lambda)
metrics_test <- calculate_metrics(transformed_test_price, predicted_test_transformed, optimal_lambda)

# Calculate SSE for the training and test data without inverse transformation
sse_train <- sum((transformed_price - predicted_train_transformed)^2)
sse_test <- sum((transformed_test_price - predicted_test_transformed)^2)

# Calculate R-squared for training and test data without inverse transformation
sst_train <- sum((transformed_price - mean(transformed_price))^2)
sst_test <- sum((transformed_test_price - mean(transformed_test_price))^2)

r_squared_train <- 1 - (sse_train / sst_train)
r_squared_test <- 1 - (sse_test / sst_test)

# Create a dataframe to store the results
df_results <- rbind(df_results, data.frame(
  Model = "Transformed Model",
  SSE_train = metrics_train$sse,
  SSE_test = metrics_test$sse,
  R_squared_train = r_squared_train,
  R_squared_test = r_squared_test,
  RMSE_train = metrics_train$rmse,
  RMSE_test = metrics_test$rmse,
  MAE_train = metrics_train$mae,
  MAE_test = metrics_test$mae
))

# View the updated model results
view_model_results(df_results, caption = generate_table_caption("Transformation Model Addition", section = 8))

8.2.2 Plots after Transformation

Figure 8.2 Diagnostic Plots after Transformation

Figure 8.3 Boxplot After Transformation ### 8.3 Detection of Multicollinearity

8.3.1 VIF-Based Multicollinearity Analysis

In this section, we perform a Variance Inflation Factor (VIF)-based multicollinearity analysis. Multicollinearity refers to the situation where predictor variables in a regression model are highly correlated with each other, which can lead to instability in coefficient estimates and difficulties in interpreting the model. VIF helps us identify and mitigate multicollinearity by quantifying how much the variance of the estimated coefficients is increased due to the correlation between predictors.

vif_values <- vif(transformed_model)
vif_values
##               bathrooms             sqft_living           sqft_basement 
##                2.885919                5.591630                2.078644 
##           sqft_living15           grade_Average distance_to_convergence 
##                3.129896                1.287715                3.586258 
##            waterfront_0           zipcode_98004           zipcode_98112 
##                1.532997                1.239202                1.140794 
##           zipcode_98039                  view_4                bedrooms 
##                1.051926                1.555985                1.654887 
##           zipcode_98040                  view_3           zipcode_98119 
##                1.178869                1.116631                1.070897 
##           zipcode_98105           zipcode_98109           zipcode_98178 
##                1.095893                1.044862                1.085291 
##           zipcode_98102           zipcode_98117           zipcode_98103 
##                1.056638                1.157612                1.218270 
##                  view_2           zipcode_98107           zipcode_98059 
##                1.097466                1.096697                1.158714 
##               year_sold             condition_5           zipcode_98056 
##                3.579772                1.129758                1.133790 
##           zipcode_98058           zipcode_98168           zipcode_98055 
##                1.133719                1.084810                1.074864 
##           zipcode_98118           zipcode_98108                sqft_lot 
##                1.188458                1.075240                1.223590 
##           zipcode_98106           zipcode_98146            yr_renovated 
##                1.116055                1.085421                1.049119 
##           zipcode_98031           zipcode_98188             condition_4 
##                1.104029                1.039494                1.214259 
##                  view_1           zipcode_98166           zipcode_98198 
##                1.049111                1.087726                1.111711 
##           zipcode_98028                  floors           zipcode_98042 
##                1.075043                2.148960                1.292638 
##           zipcode_98030           zipcode_98011           zipcode_98070 
##                1.126009                1.061104                1.121497 
##           zipcode_98032           zipcode_98155           zipcode_98034 
##                1.059415                1.123381                1.173275 
##           season_Spring             day_of_year           zipcode_98077 
##                1.573156                3.389715                1.092376 
##           zipcode_98148           season_Summer           zipcode_98125 
##                1.024765                1.620946                1.122651 
##           zipcode_98072           zipcode_98133           zipcode_98074 
##                1.089445                1.139479                1.172599 
##           zipcode_98019           zipcode_98092           zipcode_98038 
##                1.079371                1.317431                1.388642 
##           zipcode_98023           zipcode_98001           zipcode_98003 
##                1.439587                1.300638                1.213702 
##           zipcode_98002           zipcode_98065           zipcode_98126 
##                1.174290                1.195131                1.108418 
##           zipcode_98027           zipcode_98075           zipcode_98053 
##                1.128614                1.183813                1.163209 
##           zipcode_98052           zipcode_98008           zipcode_98006 
##                1.225477                1.122828                1.242755 
##           zipcode_98144           zipcode_98007           zipcode_98005 
##                1.163978                1.072503                1.118598 
##           zipcode_98014           zipcode_98024 
##                1.090945                1.051387

We would typically define a VIF threshold (in this case, 10), which serves as a criterion to identify predictor variables with high VIF values, however, there are not any VIF’s above this threshold meaning we do not have issues of multicollinearity.

8.4 Detection of Heteroscedasticity

8.4.1 Breusch Pagan Test

# Breusch-Pagan Test
bp_test_results <- bptest(transformed_model, studentize = FALSE)

# Output the results of the Breusch-Pagan test
bp_test_results
## 
##  Breusch-Pagan test
## 
## data:  transformed_model
## BP = 2482.9, df = 80, p-value < 2.2e-16

In our analysis, we conducted a Breusch-Pagan test to check for heteroscedasticity in our linear regression model, referred to as “transformed_model.” Heteroscedasticity refers to the situation where the spread of errors (residuals) in the model varies across different levels of the independent variables.

The test results showed a very low p-value (p < 2.2e-16), indicating strong evidence against the null hypothesis suggesting that heteroscedasticity is present in the residuals of our model. This finding is essential as it implies that the variability of errors is not consistent across all levels of the predictor variables. Addressing heteroscedasticity may require adjustments to improve the model’s reliability and interpretability.

8.4.2 Examining Scale Location Plot for Heteroscedasticity

Figure 8.4 Plots Examining Heteroscedasticity

8.5 Remedial Measures for Heteroscedasticity

8.5.1 Weighted Least Squares

# Calculate absolute residuals from the transformed model
transformed_model_residuals <- abs(transformed_model$residuals)

# Fit a linear model on the absolute residuals of the transformed model
residuals_model <- lm(transformed_model_residuals ~ ., data = train_df_both[, -1])

# Retrieve fitted values from the residuals model
fitted_values_residuals <- residuals_model$fitted.values

# Calculate weights as the inverse square of the fitted values
weights_iteration_1 <- 1 / (fitted_values_residuals ^ 2)

# Refit the transformed model with the new weights
transformed_model_weighted <- lm(transformed_price ~ ., weights = weights_iteration_1, data = train_df_both[, -1])

# Add coefficients to check if heteroscedascticity is resolved later
coefficients_df <- create_coefficients_df(
  model = transformed_model_weighted,
  model_name = "WT Model",
  coefficients_df = coefficients_df
)

Figure 8.5 Plots Post WLS on Transformed Model

Figure 8.6 Boxplot for WLS on Transformed Model

# Perform Breusch-Pagan test to check for heteroscedasticity
bptest_result <- bptest(transformed_model_weighted, studentize = TRUE)
bptest_result
## 
##  studentized Breusch-Pagan test
## 
## data:  transformed_model_weighted
## BP = 72140, df = 80, p-value < 2.2e-16
# Iterate to improve the model based on Breusch-Pagan test results
iteration <- 1

#TODO change back to 200
while (iteration < 200 && bptest_result$p.value < 0.05) {
  # Calculate new absolute residuals
  abs_residuals_new <- abs(transformed_model_weighted$residuals)

  # Fit a new linear model on the new absolute residuals
  residuals_model_new <- lm(abs_residuals_new ~ ., data = train_df_both[, -1])

  # Retrieve new fitted values

  fitted_values_residuals_new <- residuals_model_new$fitted.values

  # Calculate new weights
  weights_new <- 1 / (fitted_values_residuals_new ^ 2)

  # Refit the transformed model with the new weights
  transformed_model_weighted <- lm(transformed_price ~ ., weights = weights_new, data = train_df_both[,-1])

  # Update the Breusch-Pagan test result
  bptest_result <- bptest(transformed_model_weighted, studentize = TRUE)

  # Increment iteration counter
  iteration <- iteration + 1
}

# Extract coefficients from the final weighted least squares model
coefficients_wls_final <- transformed_model_weighted$coefficients

# Summary and diagnostic plots of the final model
summary(transformed_model_weighted)
## 
## Call:
## lm(formula = transformed_price ~ ., data = train_df_both[, -1], 
##     weights = weights_new)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2503  -0.7468   0.0648   0.8272   9.2738 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -4.862e+02  3.668e+01 -13.256  < 2e-16 ***
## bathrooms                1.577e-01  1.020e-02  15.454  < 2e-16 ***
## sqft_living              8.413e-04  1.229e-05  68.454  < 2e-16 ***
## sqft_basement           -3.401e-04  1.570e-05 -21.657  < 2e-16 ***
## sqft_living15            3.792e-04  1.234e-05  30.731  < 2e-16 ***
## grade_Average           -1.353e-01  5.005e-02  -2.703 0.006878 ** 
## distance_to_convergence -5.831e-02  9.368e-04 -62.249  < 2e-16 ***
## waterfront_0            -1.555e+00  1.098e-01 -14.166  < 2e-16 ***
## zipcode_98004            7.300e-01  4.945e-02  14.764  < 2e-16 ***
## zipcode_98112            8.216e-01  6.475e-02  12.690  < 2e-16 ***
## zipcode_98039            1.291e+00  9.424e-02  13.703  < 2e-16 ***
## view_4                   1.066e+00  7.725e-02  13.798  < 2e-16 ***
## bedrooms                 1.449e-02  1.685e-03   8.601  < 2e-16 ***
## zipcode_98040            1.818e-01  4.894e-02   3.714 0.000205 ***
## view_3                   6.488e-01  4.759e-02  13.631  < 2e-16 ***
## zipcode_98119            8.092e-01  6.280e-02  12.885  < 2e-16 ***
## zipcode_98105            4.254e-01  4.995e-02   8.517  < 2e-16 ***
## zipcode_98109            7.409e-01  8.934e-02   8.292  < 2e-16 ***
## zipcode_98178           -2.010e+00  4.512e-02 -44.547  < 2e-16 ***
## zipcode_98102            5.496e-01  8.151e-02   6.743 1.61e-11 ***
## zipcode_98117            3.015e-01  3.389e-02   8.898  < 2e-16 ***
## zipcode_98103            2.025e-01  4.090e-02   4.952 7.44e-07 ***
## view_2                   4.007e-01  2.751e-02  14.564  < 2e-16 ***
## zipcode_98107            2.991e-01  3.558e-02   8.408  < 2e-16 ***
## zipcode_98059           -1.290e+00  3.055e-02 -42.239  < 2e-16 ***
## year_sold                2.540e-01  1.820e-02  13.953  < 2e-16 ***
## condition_5              3.451e-01  1.712e-02  20.156  < 2e-16 ***
## zipcode_98056           -1.486e+00  4.084e-02 -36.385  < 2e-16 ***
## zipcode_98058           -1.493e+00  3.314e-02 -45.068  < 2e-16 ***
## zipcode_98168           -2.059e+00  5.034e-02 -40.899  < 2e-16 ***
## zipcode_98055           -1.741e+00  4.893e-02 -35.582  < 2e-16 ***
## zipcode_98118           -1.261e+00  5.570e-02 -22.648  < 2e-16 ***
## zipcode_98108           -1.418e+00  5.562e-02 -25.487  < 2e-16 ***
## sqft_lot                 3.939e-06  2.516e-07  15.661  < 2e-16 ***
## zipcode_98106           -1.419e+00  4.703e-02 -30.163  < 2e-16 ***
## zipcode_98146           -1.450e+00  7.206e-02 -20.117  < 2e-16 ***
## yr_renovated             1.261e-04  1.536e-05   8.212 2.34e-16 ***
## zipcode_98031           -1.546e+00  2.948e-02 -52.429  < 2e-16 ***
## zipcode_98188           -1.799e+00  5.823e-02 -30.890  < 2e-16 ***
## condition_4              1.192e-01  1.104e-02  10.796  < 2e-16 ***
## view_1                   3.773e-01  4.425e-02   8.528  < 2e-16 ***
## zipcode_98166           -1.056e+00  5.523e-02 -19.119  < 2e-16 ***
## zipcode_98198           -1.465e+00  5.295e-02 -27.662  < 2e-16 ***
## zipcode_98028           -9.005e-01  3.832e-02 -23.497  < 2e-16 ***
## floors                  -5.829e-02  1.243e-02  -4.688 2.78e-06 ***
## zipcode_98042           -1.338e+00  3.075e-02 -43.524  < 2e-16 ***
## zipcode_98030           -1.427e+00  3.790e-02 -37.641  < 2e-16 ***
## zipcode_98011           -8.192e-01  4.130e-02 -19.837  < 2e-16 ***
## zipcode_98070           -6.307e-01  8.484e-02  -7.434 1.11e-13 ***
## zipcode_98032           -1.554e+00  5.740e-02 -27.080  < 2e-16 ***
## zipcode_98155           -8.536e-01  3.121e-02 -27.347  < 2e-16 ***
## zipcode_98034           -7.467e-01  2.887e-02 -25.864  < 2e-16 ***
## season_Spring            1.163e-01  1.236e-02   9.405  < 2e-16 ***
## day_of_year              5.172e-04  8.696e-05   5.948 2.78e-09 ***
## zipcode_98077           -5.485e-01  4.478e-02 -12.251  < 2e-16 ***
## zipcode_98148           -1.520e+00  1.019e-01 -14.914  < 2e-16 ***
## season_Summer            6.804e-02  1.245e-02   5.467 4.66e-08 ***
## zipcode_98125           -6.058e-01  3.751e-02 -16.151  < 2e-16 ***
## zipcode_98072           -6.006e-01  3.746e-02 -16.033  < 2e-16 ***
## zipcode_98133           -7.459e-01  3.209e-02 -23.241  < 2e-16 ***
## zipcode_98074           -4.131e-01  2.996e-02 -13.787  < 2e-16 ***
## zipcode_98019           -7.875e-01  3.936e-02 -20.007  < 2e-16 ***
## zipcode_98092           -1.076e+00  3.604e-02 -29.861  < 2e-16 ***
## zipcode_98038           -8.934e-01  2.684e-02 -33.283  < 2e-16 ***
## zipcode_98023           -1.087e+00  3.145e-02 -34.554  < 2e-16 ***
## zipcode_98001           -1.190e+00  3.768e-02 -31.574  < 2e-16 ***
## zipcode_98003           -1.025e+00  4.106e-02 -24.963  < 2e-16 ***
## zipcode_98002           -1.310e+00  4.271e-02 -30.666  < 2e-16 ***
## zipcode_98065           -3.225e-01  3.172e-02 -10.167  < 2e-16 ***
## zipcode_98126           -6.220e-01  4.538e-02 -13.707  < 2e-16 ***
## zipcode_98027           -3.802e-01  3.566e-02 -10.660  < 2e-16 ***
## zipcode_98075           -3.493e-01  3.182e-02 -10.979  < 2e-16 ***
## zipcode_98053           -2.948e-01  4.019e-02  -7.336 2.31e-13 ***
## zipcode_98052           -3.722e-01  2.578e-02 -14.436  < 2e-16 ***
## zipcode_98008           -5.119e-01  3.825e-02 -13.383  < 2e-16 ***
## zipcode_98006           -4.746e-01  3.752e-02 -12.649  < 2e-16 ***
## zipcode_98144           -6.900e-01  4.715e-02 -14.636  < 2e-16 ***
## zipcode_98007           -5.602e-01  4.714e-02 -11.885  < 2e-16 ***
## zipcode_98005           -3.290e-01  5.246e-02  -6.272 3.67e-10 ***
## zipcode_98014           -3.658e-01  1.019e-01  -3.589 0.000333 ***
## zipcode_98024           -5.215e-01  7.779e-02  -6.704 2.10e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.331 on 15045 degrees of freedom
## Multiple R-squared:  0.8798, Adjusted R-squared:  0.8792 
## F-statistic:  1376 on 80 and 15045 DF,  p-value: < 2.2e-16
# Evaluate model performance on the training dataset
predicted_train <- predict(transformed_model_weighted, newdata = train_df_both)
MAE_train <- mae(train_df_both$price, predicted_train)
SSE_train <- sum((train_df_both$price - predicted_train) ^ 2)
R_squared_train <- R2(train_df_both$price, predicted_train)
RMSE_train <- rmse(train_df_both$price, predicted_train)

# Evaluate model performance on the test dataset
predicted_test <- predict(transformed_model_weighted, newdata = test_df_both)
MAE_test <- mae(test_df_both$price, predicted_test)
SSE_test <- sum((test_df_both$price - predicted_test) ^ 2)
R_squared_test <- R2(test_df_both$price, predicted_test)
RMSE_test <- rmse(test_df_both$price, predicted_test)

# Add the performance metrics of the transformed model to the df_results dataframe
df_results <- rbind(df_results, data.frame(
  Model = "WT Model 200",
  SSE_train = SSE_train,
  SSE_test = SSE_test,
  R_squared_train = R_squared_train,
  R_squared_test = R_squared_test,
  RMSE_train = RMSE_train,
  RMSE_test = RMSE_test,
  MAE_train = MAE_train,
  MAE_test = MAE_test
))

summary(transformed_model_weighted)
## 
## Call:
## lm(formula = transformed_price ~ ., data = train_df_both[, -1], 
##     weights = weights_new)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2503  -0.7468   0.0648   0.8272   9.2738 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -4.862e+02  3.668e+01 -13.256  < 2e-16 ***
## bathrooms                1.577e-01  1.020e-02  15.454  < 2e-16 ***
## sqft_living              8.413e-04  1.229e-05  68.454  < 2e-16 ***
## sqft_basement           -3.401e-04  1.570e-05 -21.657  < 2e-16 ***
## sqft_living15            3.792e-04  1.234e-05  30.731  < 2e-16 ***
## grade_Average           -1.353e-01  5.005e-02  -2.703 0.006878 ** 
## distance_to_convergence -5.831e-02  9.368e-04 -62.249  < 2e-16 ***
## waterfront_0            -1.555e+00  1.098e-01 -14.166  < 2e-16 ***
## zipcode_98004            7.300e-01  4.945e-02  14.764  < 2e-16 ***
## zipcode_98112            8.216e-01  6.475e-02  12.690  < 2e-16 ***
## zipcode_98039            1.291e+00  9.424e-02  13.703  < 2e-16 ***
## view_4                   1.066e+00  7.725e-02  13.798  < 2e-16 ***
## bedrooms                 1.449e-02  1.685e-03   8.601  < 2e-16 ***
## zipcode_98040            1.818e-01  4.894e-02   3.714 0.000205 ***
## view_3                   6.488e-01  4.759e-02  13.631  < 2e-16 ***
## zipcode_98119            8.092e-01  6.280e-02  12.885  < 2e-16 ***
## zipcode_98105            4.254e-01  4.995e-02   8.517  < 2e-16 ***
## zipcode_98109            7.409e-01  8.934e-02   8.292  < 2e-16 ***
## zipcode_98178           -2.010e+00  4.512e-02 -44.547  < 2e-16 ***
## zipcode_98102            5.496e-01  8.151e-02   6.743 1.61e-11 ***
## zipcode_98117            3.015e-01  3.389e-02   8.898  < 2e-16 ***
## zipcode_98103            2.025e-01  4.090e-02   4.952 7.44e-07 ***
## view_2                   4.007e-01  2.751e-02  14.564  < 2e-16 ***
## zipcode_98107            2.991e-01  3.558e-02   8.408  < 2e-16 ***
## zipcode_98059           -1.290e+00  3.055e-02 -42.239  < 2e-16 ***
## year_sold                2.540e-01  1.820e-02  13.953  < 2e-16 ***
## condition_5              3.451e-01  1.712e-02  20.156  < 2e-16 ***
## zipcode_98056           -1.486e+00  4.084e-02 -36.385  < 2e-16 ***
## zipcode_98058           -1.493e+00  3.314e-02 -45.068  < 2e-16 ***
## zipcode_98168           -2.059e+00  5.034e-02 -40.899  < 2e-16 ***
## zipcode_98055           -1.741e+00  4.893e-02 -35.582  < 2e-16 ***
## zipcode_98118           -1.261e+00  5.570e-02 -22.648  < 2e-16 ***
## zipcode_98108           -1.418e+00  5.562e-02 -25.487  < 2e-16 ***
## sqft_lot                 3.939e-06  2.516e-07  15.661  < 2e-16 ***
## zipcode_98106           -1.419e+00  4.703e-02 -30.163  < 2e-16 ***
## zipcode_98146           -1.450e+00  7.206e-02 -20.117  < 2e-16 ***
## yr_renovated             1.261e-04  1.536e-05   8.212 2.34e-16 ***
## zipcode_98031           -1.546e+00  2.948e-02 -52.429  < 2e-16 ***
## zipcode_98188           -1.799e+00  5.823e-02 -30.890  < 2e-16 ***
## condition_4              1.192e-01  1.104e-02  10.796  < 2e-16 ***
## view_1                   3.773e-01  4.425e-02   8.528  < 2e-16 ***
## zipcode_98166           -1.056e+00  5.523e-02 -19.119  < 2e-16 ***
## zipcode_98198           -1.465e+00  5.295e-02 -27.662  < 2e-16 ***
## zipcode_98028           -9.005e-01  3.832e-02 -23.497  < 2e-16 ***
## floors                  -5.829e-02  1.243e-02  -4.688 2.78e-06 ***
## zipcode_98042           -1.338e+00  3.075e-02 -43.524  < 2e-16 ***
## zipcode_98030           -1.427e+00  3.790e-02 -37.641  < 2e-16 ***
## zipcode_98011           -8.192e-01  4.130e-02 -19.837  < 2e-16 ***
## zipcode_98070           -6.307e-01  8.484e-02  -7.434 1.11e-13 ***
## zipcode_98032           -1.554e+00  5.740e-02 -27.080  < 2e-16 ***
## zipcode_98155           -8.536e-01  3.121e-02 -27.347  < 2e-16 ***
## zipcode_98034           -7.467e-01  2.887e-02 -25.864  < 2e-16 ***
## season_Spring            1.163e-01  1.236e-02   9.405  < 2e-16 ***
## day_of_year              5.172e-04  8.696e-05   5.948 2.78e-09 ***
## zipcode_98077           -5.485e-01  4.478e-02 -12.251  < 2e-16 ***
## zipcode_98148           -1.520e+00  1.019e-01 -14.914  < 2e-16 ***
## season_Summer            6.804e-02  1.245e-02   5.467 4.66e-08 ***
## zipcode_98125           -6.058e-01  3.751e-02 -16.151  < 2e-16 ***
## zipcode_98072           -6.006e-01  3.746e-02 -16.033  < 2e-16 ***
## zipcode_98133           -7.459e-01  3.209e-02 -23.241  < 2e-16 ***
## zipcode_98074           -4.131e-01  2.996e-02 -13.787  < 2e-16 ***
## zipcode_98019           -7.875e-01  3.936e-02 -20.007  < 2e-16 ***
## zipcode_98092           -1.076e+00  3.604e-02 -29.861  < 2e-16 ***
## zipcode_98038           -8.934e-01  2.684e-02 -33.283  < 2e-16 ***
## zipcode_98023           -1.087e+00  3.145e-02 -34.554  < 2e-16 ***
## zipcode_98001           -1.190e+00  3.768e-02 -31.574  < 2e-16 ***
## zipcode_98003           -1.025e+00  4.106e-02 -24.963  < 2e-16 ***
## zipcode_98002           -1.310e+00  4.271e-02 -30.666  < 2e-16 ***
## zipcode_98065           -3.225e-01  3.172e-02 -10.167  < 2e-16 ***
## zipcode_98126           -6.220e-01  4.538e-02 -13.707  < 2e-16 ***
## zipcode_98027           -3.802e-01  3.566e-02 -10.660  < 2e-16 ***
## zipcode_98075           -3.493e-01  3.182e-02 -10.979  < 2e-16 ***
## zipcode_98053           -2.948e-01  4.019e-02  -7.336 2.31e-13 ***
## zipcode_98052           -3.722e-01  2.578e-02 -14.436  < 2e-16 ***
## zipcode_98008           -5.119e-01  3.825e-02 -13.383  < 2e-16 ***
## zipcode_98006           -4.746e-01  3.752e-02 -12.649  < 2e-16 ***
## zipcode_98144           -6.900e-01  4.715e-02 -14.636  < 2e-16 ***
## zipcode_98007           -5.602e-01  4.714e-02 -11.885  < 2e-16 ***
## zipcode_98005           -3.290e-01  5.246e-02  -6.272 3.67e-10 ***
## zipcode_98014           -3.658e-01  1.019e-01  -3.589 0.000333 ***
## zipcode_98024           -5.215e-01  7.779e-02  -6.704 2.10e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.331 on 15045 degrees of freedom
## Multiple R-squared:  0.8798, Adjusted R-squared:  0.8792 
## F-statistic:  1376 on 80 and 15045 DF,  p-value: < 2.2e-16
# Add coefficients to dataframe
coefficients_df <- create_coefficients_df(
  model = transformed_model_weighted,
  model_name = "WT Model 200",
  coefficients_df = coefficients_df
)

# View the updated model results
view_model_results(df_results, caption = generate_table_caption("Weighted Transformation Model Addition", section = 8))

8.5.2 Weighted Least Squares Results Post 200 Iterations

bptest(transformed_model_weighted, studentize = TRUE)
## 
##  studentized Breusch-Pagan test
## 
## data:  transformed_model_weighted
## BP = 76641, df = 80, p-value < 2.2e-16

After performing the initial Breusch-Pagan test to assess heteroscedasticity on the weighted linear model (transformed_model_weighted), the test yielded a statistic of 71397 with 80 degrees of freedom, resulting in a p-value less than 2.2e-16, indicating significant heteroscedasticity in the model. This initial result indicated that the residuals of the model exhibited a pattern that violated the assumption of constant variance.

To address this issue, 200 iterations were carried out, where the model was repeatedly re-fitted and the Breusch-Pagan test was performed after each iteration. After these iterations, the Breusch-Pagan test resulted in a significantly larger statistic of 976748 with the same 80 degrees of freedom, and a p-value still less than 2.2e-16. Despite the numerous iterations, the p-value remained extremely low, indicating that heteroscedasticity persisted in the model.

Even after 200 iterations of weighting the linear model in an attempt to address heteroscedasticity, the model continued to exhibit significant heteroscedasticity, as indicated by the persistently low p-value from the Breusch-Pagan test. Further investigation or alternative modeling approaches may be necessary to effectively address this issue.

Figure 8.7 Plots Post WLS on Transformed Model after 200 Iterations

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

Figure 8.8 Boxplot for WLS on Transformed Model after 200 Iterations

# Fetch the Weighted coefficients to compare results
model_names_to_match <- c("WT Model", "WT Model 200")

# Use the subset function to select rows where Model_Name matches any of the model names in the list
selected_rows <- subset(coefficients_df, Model_Name %in% model_names_to_match)

# Identify columns with NA values in the selected subset of rows
cols_with_na <- colnames(selected_rows)[apply(selected_rows, 2, anyNA)]

# Remove columns with NA values from the selected subset
selected_rows <- selected_rows[, !colnames(selected_rows) %in% cols_with_na]

8.6 Ridge Regression

x <- data.matrix(dplyr::select(train_df_both, -price))
y <- log(train_df_both$price)

# Problem 8.5: Run Ridge Regression
set.seed(1023)
RidgeMod <- glmnet(x, y, alpha = 0, nlambda = 100, lambda.min.ratio = 0.0001)
CvRidgeMod <- cv.glmnet(x, y, alpha = 0, nlambda = 100, lambda.min.ratio = 0.0001)

# Find the best lambda
best.lambda.ridge <- CvRidgeMod$lambda.min
print(paste0("Best Lambda: ", best.lambda.ridge))
## [1] "Best Lambda: 0.0363370475980558"
# Add coefficients to dataframe
coefficients_df <- create_coefficients_df(
  model = coefficients(RidgeMod,s=best.lambda.ridge),
  model_name = "Ridge Model",
  coefficients_df = coefficients_df
)
# Fetch the coefficients to compare results
model_names_to_match <- c("Ridge Model", "WT Model 200")

# Use the subset function to select rows where Model_Name matches any of the model names in the list
selected_rows <- subset(coefficients_df, Model_Name %in% model_names_to_match)

# Identify columns with NA values in the selected subset of rows
cols_with_na <- colnames(selected_rows)[apply(selected_rows, 2, anyNA)]

# Remove columns with NA values from the selected subset
selected_rows <- selected_rows[, !colnames(selected_rows) %in% cols_with_na]

Some Explaination about the Ridge regression here

Figure 8.9 Ridge Regression Plot

8.7 Lasso Regression

LassoMod <- glmnet(x, y, alpha = 1, nlambda = 100, lambda.min.ratio = 0.0001)
CvLassoMod <- cv.glmnet(x, y, alpha = 1, nlambda = 100, lambda.min.ratio = 0.0001)

# Find the best lambda for Lasso
best.lambda.lasso <- CvLassoMod$lambda.min
print(paste0("Best Lambda for Lasso: ", best.lambda.lasso))
## [1] "Best Lambda for Lasso: 0.000176692585065216"
# Add coefficients to dataframe for Lasso
coefficients_df <- create_coefficients_df(
  model = coefficients(LassoMod, s = best.lambda.lasso),
  model_name = "Lasso Model",
  coefficients_df = coefficients_df
)
# Fetch the coefficients to compare results between Lasso and other models
model_names_to_match <- c("Lasso Model", "WT Model 200")

# Use the subset function to select rows where Model_Name matches any of the model names in the list
selected_rows <- subset(coefficients_df, Model_Name %in% model_names_to_match)

# Identify columns with NA values in the selected subset of rows
cols_with_na <- colnames(selected_rows)[apply(selected_rows, 2, anyNA)]

# Remove columns with NA values from the selected subset
selected_rows <- selected_rows[, !colnames(selected_rows) %in% cols_with_na]

Figure 8.10 Lasso Regression Plot

8.8 Elastic Net Regression

# Set alpha for Elastic Net
alpha_elastic_net <- 0.5

# Problem 8.5: Run Elastic Net Regression
ElasticNetMod <- glmnet(x, y, alpha = alpha_elastic_net, nlambda = 100, lambda.min.ratio = 0.0001)
CvElasticNetMod <- cv.glmnet(x, y, alpha = alpha_elastic_net, nlambda = 100, lambda.min.ratio = 0.0001)

# Find the best lambda for Elastic Net
best.lambda.elastic_net <- CvElasticNetMod$lambda.min
print(paste0("Best Lambda for Elastic Net: ", best.lambda.elastic_net))
## [1] "Best Lambda for Elastic Net: 0.000321991405586386"
# Add coefficients to dataframe for Elastic Net
coefficients_df <- create_coefficients_df(
  model = coefficients(ElasticNetMod, s = best.lambda.elastic_net),
  model_name = "EN Model",
  coefficients_df = coefficients_df
)
# Fetch the coefficients to compare results between Elastic Net and other models
model_names_to_match <- c("EN Model", "WT Model 200")

# Use the subset function to select rows where Model_Name matches any of the model names in the list
selected_rows <- subset(coefficients_df, Model_Name %in% model_names_to_match)

# Identify columns with NA values in the selected subset of rows
cols_with_na <- colnames(selected_rows)[apply(selected_rows, 2, anyNA)]

# Remove columns with NA values from the selected subset
selected_rows <- selected_rows[, !colnames(selected_rows) %in% cols_with_na]

Figure 8.11 Elastic Net Regression Plot

8.9 Huber Robust Regression

# Run Huber Regression using rlm
HuberMod <- rlm(log(price) ~ ., data = train_df_both)

# Add coefficients to dataframe for Huber
coefficients_df <- create_coefficients_df(
  model = HuberMod,
  model_name = "Huber Model",
  coefficients_df = coefficients_df
)
# Fetch the coefficients to compare results between Huber and other models
model_names_to_match <- c("Huber Model", "WT Model 200")

# Use the subset function to select rows where Model_Name matches any of the model names in the list
selected_rows <- subset(coefficients_df, Model_Name %in% model_names_to_match)

# Identify columns with NA values in the selected subset of rows
cols_with_na <- colnames(selected_rows)[apply(selected_rows, 2, anyNA)]

# Remove columns with NA values from the selected subset
selected_rows <- selected_rows[, !colnames(selected_rows) %in% cols_with_na]

8.10 Coefficients Examination

10.1 Coefficient Comparison Function

This function is designed to provide a comprehensive comparison of regression model coefficients. It specifically targets scenarios where multiple models have been fit to the same dataset, and there’s a need to understand the variability or consistency of these models in terms of their coefficients. This comparison is crucial in areas like feature importance analysis, model stability assessment, and in addressing multicollinearity, normality, and other statistical aspects.

10.1.1 Function Overview

The calculate_total_coefficient_differences_for_all function iterates over each model in a given dataframe, calculates the sum of absolute differences in coefficients between that model and a predefined list of comparison models, and then adds these sums as a new column in the dataframe. This approach allows for a direct, quantitative comparison of how each model’s coefficients differ from those of other specified models.

10.1.2 Detailed Utility of Function
  • Model Stability Analysis: It provides insights into the stability of coefficients across different models. Large differences might indicate model instability or sensitivity to certain features.

  • Feature Importance Consistency: By comparing coefficient magnitudes across models, it can help assess the consistency of feature importance rankings.

  • Addressing Statistical Issues: It aids in evaluating how different models react to multicollinearity and other statistical concerns, especially if the coefficients vary significantly between models.

  • Model Selection Guidance: The function can guide model selection by highlighting models with similar or drastically different behavior in terms of their coefficients.

# List of comparison models
comparison_model_names <- c("Huber Model", "Ridge Model", "Lasso Model", "EN Model")

# Function to calculate the total sum of absolute differences in coefficients for each model
calculate_total_coefficient_differences_for_all <- function(coefficients_df, comparison_models) {
  # Get all model names
  all_model_names <- coefficients_df$Model_Name

  # Initialize a named numeric vector to store the sum of differences for each model
  total_diffs <- setNames(numeric(length(all_model_names)), all_model_names)

  # Iterate over each model in coefficients_df
  for (target_model in all_model_names) {
    target_coeffs <- coefficients_df[coefficients_df$Model_Name == target_model, -1]
    total_diff <- 0  # Initialize total difference for this model

    # Calculate sum of absolute differences with each comparison model
    for (model in comparison_models) {
      comparison_coeffs <- coefficients_df[coefficients_df$Model_Name == model, -1]
      common_features <- intersect(names(target_coeffs), names(comparison_coeffs))
      abs_diff <- abs(target_coeffs[common_features] - comparison_coeffs[common_features])
      total_diff <- total_diff + sum(abs_diff, na.rm = TRUE)
    }

    total_diffs[target_model] <- total_diff
  }

  return(total_diffs)
}

# Calculate the total differences for all models
model_diffs <- calculate_total_coefficient_differences_for_all(coefficients_df, comparison_model_names)

# Add the calculated differences as a new column in the coefficients_df
coefficients_df$total_diff <- model_diffs[coefficients_df$Model_Name]

# Move the new column to the second position
coefficients_df <- coefficients_df[, c("Model_Name", "total_diff", setdiff(names(coefficients_df), c("Model_Name", "total_diff")))]

8.11 Detailed Coefficient Analysis

The results compare various regression models based on the total sum of absolute differences in their coefficients when compared to a set of reference models: “Huber Model,” “Ridge Model,” “Lasso Model,” and “EN Model.” This analysis helps determine which model is most consistent with the reference models in terms of coefficient values, which can be indicative of their effectiveness in addressing diagnostic issues like multicollinearity, outliers, and model stability. Let’s break down these results for a detailed analysis:

8.11.1 “Initial OLS Model”

With a total difference of approximately 319 million, this model shows a substantial deviation in coefficients compared to the reference models. This significant difference suggests that the “Initial OLS Model” may have issues with multicollinearity or outliers that the reference models have addressed more effectively.

8.11.2 “OLS w/o corr” and Other OLS Variants

Similar to the “Initial OLS Model,” these OLS-based models (“Inital Step Both,” “Step Forward,” “Step Backward,” “OLS Both outliers”) have very high total differences (ranging from around 321 million to 470 million). This indicates that these models, despite their variations, have not substantially improved in aligning their coefficients with the reference models. They likely still struggle with issues like multicollinearity or outlier sensitivity.

8.11.3 “Transformed Model” and “WT Model”

These models show a drastic reduction in total differences (1824 and 1536, respectively), indicating a significant improvement over the basic OLS models. The transformations or adjustments made in these models appear to have aligned their coefficients more closely with the reference models, suggesting better handling of diagnostic issues.

8.11.4 “WT Model 200”

After 200 iterations of Weighted Least Squares (WLS) regression, this model shows a remarkable improvement with a total difference of only 78. This suggests that iterative re-weighting has significantly enhanced the model’s alignment with the reference models, potentially indicating effective handling of heteroscedasticity and outliers.

8.11.5 “Ridge Model,” “Lasso Model,” “EN Model,” and “Huber Model”

These models have the lowest total differences (ranging from 51 to 139), indicating a very close alignment with each other. This consistency suggests that these models are likely very effective in addressing multicollinearity and outliers. Their regularization techniques (Ridge, Lasso, Elastic Net) or robust regression approach (Huber) seem to stabilize the coefficient estimates substantially.


9. Alternative Models

9.1 Regression Tree Model

# Load and prepare data for this section
train_df_logis <- train_df_non_linear
test_df_logis <- test_df_non_linear


# Create data frames for regression tree analysis
# Remove 'price_cat' and keep 'price' in the datasets
train_df_reg <- train_df_logis[, !names(train_df_logis) %in% "price_cat"]
test_df_reg <- test_df_logis[, !names(test_df_logis) %in% "price_cat"]

# Define a broader range of hyperparameters to search
tuneGrid <- expand.grid(
  cp = seq(0.001, 0.1, by = 0.001)
)

# Create a custom tuning grid for rpart
customControl <- trainControl(
  method = "cv",
  number = 5,
  search = "grid",
  verboseIter = FALSE
)

# Perform hyperparameter tuning with custom control
model <- train(
  price ~ .,
  data = train_df_reg,
  method = "rpart",
  trControl = customControl,
  tuneGrid = tuneGrid
)

# Make predictions
p.rpart_train <- predict(model, newdata = train_df_reg)
p.rpart_test <- predict(model, newdata = test_df_reg)

# Calculate metrics for training dataset
MAE_train <- mae(train_df_reg$price, p.rpart_train)
SSE_train <- sum((train_df_reg$price - p.rpart_train)^2)
R_squared_train <- R2(train_df_reg$price, p.rpart_train)
RMSE_train <- rmse(train_df_reg$price, p.rpart_train)

# Calculate metrics for testing dataset
MAE_test <- mae(test_df_reg$price, p.rpart_test)
SSE_test <- sum((test_df_reg$price - p.rpart_test)^2)
R_squared_test <- R2(test_df_reg$price, p.rpart_test)
RMSE_test <- rmse(test_df_reg$price, p.rpart_test)

# Append the results to df_results
df_results <- rbind(df_results, data.frame(
  Model = "Regression Tree",
  SSE_train = SSE_train,
  SSE_test = SSE_test,
  R_squared_train = R_squared_train,
  R_squared_test = R_squared_test,
  RMSE_train = RMSE_train,
  RMSE_test = RMSE_test,
  MAE_train = MAE_train,
  MAE_test = MAE_test,
  stringsAsFactors = FALSE
))

# View Model Results
view_model_results(df_results, generate_table_caption("Regression Tree Model Addition", section = 9))

9.2 Neural Network Model

# Define a normalization function
normalize_data <- function(x) {
  return((x - min(x)) / (max(x) - min(x)))
}

# Function to prepare the dataset
prepare_dataset <- function(df) {
  df$waterfront <- as.numeric(df$waterfront_1)
  selected_columns <- setdiff(names(df), c("season", "price_cat", "waterfront_0", "waterfront_1"))
  df_normalized <- as.data.frame(lapply(df[selected_columns], normalize_data))
  return(df_normalized)
}

# Function to rescale normalized data back to original scale
rescale_to_original <- function(x, min_val, max_val) {
  return((x * (max_val - min_val)) + min_val)
}

# Define model builder
build_model <- function(hp) {
  model <- keras_model_sequential() %>%
    layer_dense(units = hp$Int('units_1', min_value = 32, max_value = 128, step = 32),
                activation = 'relu', input_shape = ncol(train_data) - 1) %>%
    layer_dense(units = hp$Int('units_2', min_value = 16, max_value = 64, step = 16),
                activation = 'relu') %>%
    # Add more layers if needed
    layer_dense(units = 1)

  model %>% compile(
    optimizer = optimizer_adam(hp$Float('learning_rate', 1e-4, 1e-2, sampling = 'log')),
    loss = 'mean_squared_error',
    metrics = c('mean_absolute_error')
  )
  return(model)
}

train_and_evaluate_keras_nn <- function(train_data, test_data, original_train_price, original_test_price) {
  # Initialize variables to store the best model and its lowest validation MAE
  best_model <- NULL
  lowest_val_mae <- Inf
  best_batch_size <- NULL  # Store the best batch size
  best_epochs <- NULL  # Store the best number of epochs
  # Smaller batch sizes drastically increase computation required
  batch_sizes <- c(50, 60, 70, 80, 90, 100, 110, 120, 130)  # Different batch sizes to experiment with
  epochs_list <- c(50, 100, 150, 200, 250)  # Different numbers of epochs to experiment with

  for (batch_size in batch_sizes) {
    for (epochs in epochs_list) {
      # Define a complex Keras model with multiple layers
      model <- keras_model_sequential() %>%
        layer_dense(units = 256, activation = 'relu', input_shape = ncol(train_data) - 1) %>%
        layer_dropout(rate = 0.3) %>% # Accounts for multicollinearity and prevents overfit
        layer_dense(units = 128, activation = 'relu') %>%
        layer_dropout(rate = 0.3) %>%
        layer_dense(units = 64, activation = 'relu') %>%
        layer_dense(units = 32, activation = 'relu') %>%
        layer_dense(units = 16, activation = 'relu') %>%
        layer_dense(units = 8, activation = 'relu') %>%
        layer_dense(units = 4, activation = 'relu') %>%
        layer_dense(units = 2, activation = 'relu') %>%
        layer_dense(units = 1)

      # Compile the model
      model %>% compile(
        loss = 'mean_squared_error',
        optimizer = 'adam',
        metrics = c('mean_absolute_error')
      )

      # Fit the model to training data
      history <- model %>% fit(
        x = as.matrix(train_data[, -which(names(train_data) == "price")]),
        y = train_data$price,
        epochs = epochs,
        batch_size = batch_size,
        validation_split = 0.2  # Use 20% of training data for validation
      )

      # Check the performance on the validation set
      val_mae <- min(history$metrics$val_mean_absolute_error)
      if (val_mae < lowest_val_mae) {
        lowest_val_mae <- val_mae
        best_model <- model
        best_batch_size <- batch_size
        best_epochs <- epochs
      }
    }
  }

  # Print the best hyperparameters and their correlation
  cat("Best Batch Size:", best_batch_size, "\n")
  cat("Best Number of Epochs:", best_epochs, "\n")

  # Make predictions on the training and test data
  train_predictions <- predict(best_model, x = as.matrix(train_data[, -which(names(train_data) == "price")]))
  test_predictions <- predict(best_model, x = as.matrix(test_data[, -which(names(test_data) == "price")]))

  # Rescale predictions back to original scale
  rescaled_train_predictions <- rescale_to_original(train_predictions, min(original_train_price), max(original_train_price))
  rescaled_test_predictions <- rescale_to_original(test_predictions, min(original_test_price), max(original_test_price))

  # Calculate various performance metrics
  metrics <- list(
    mae_train = mean(abs(rescaled_train_predictions - original_train_price)),
    mae_test = mean(abs(rescaled_test_predictions - original_test_price)),
    RMSE_train = sqrt(mean((rescaled_train_predictions - original_train_price)^2)),
    RMSE_test = sqrt(mean((rescaled_test_predictions - original_test_price)^2)),
    SSE_train = sum((rescaled_train_predictions - original_train_price)^2),
    SSE_test = sum((rescaled_test_predictions - original_test_price)^2),
    correlation_train = cor(rescaled_train_predictions, original_train_price),
    correlation_test = cor(rescaled_test_predictions, original_test_price),
    r_squared_train = cor(rescaled_train_predictions, original_train_price)^2,
    r_squared_test = cor(rescaled_test_predictions, original_test_price)^2
  )

  print(paste("Model Correlation:", metrics$correlation_test))

  return(metrics)
}
# Define the training and test datasets
train_df_logis <- train_df_non_linear
test_df_logis <- test_df_non_linear

# Combine train and test data for normalization
combined_df <- rbind(train_df_logis, test_df_logis)

# Normalize the combined data
combined_norm <- prepare_dataset(combined_df)

# Split the normalized data back into train and test sets
train_norm <- combined_norm[seq_len(nrow(train_df_logis)), ]
test_norm <- combined_norm[(nrow(train_df_logis) + 1):nrow(combined_norm), ]

# Original non-normalized price data
original_train_price <- train_df_non_linear$price
original_test_price <- test_df_non_linear$price

# Train and evaluate the model
if (update_model_parameters) {
  library(keras)
  library(reticulate)
  # Import python as the backend for using Keras
  use_python(python = "C:\\Users\\Charl\\AppData\\Local\\Programs\\Python\\Python39\\python.exe", required = TRUE)
  # Fetch metrics
  nn_metrics <- train_and_evaluate_keras_nn(train_norm, test_norm, original_train_price, original_test_price)
  # Code to update model parameters in JSON
  update_model_json("NeuralNetwork", nn_metrics, json_filepath)
} else {
  # Code to load model parameters from JSON
  model_params <- fromJSON(json_filepath)
  nn_metrics <- model_params$NeuralNetwork
}

# Prepare results for the dataframe
nn_model_row <- data.frame(
  Model = "Neural Network",
  SSE_train = nn_metrics$SSE_train,
  SSE_test = nn_metrics$SSE_test,
  R_squared_train = nn_metrics$r_squared_train,
  R_squared_test = nn_metrics$r_squared_test,
  RMSE_train = nn_metrics$RMSE_train,
  RMSE_test = nn_metrics$RMSE_test,
  MAE_train = nn_metrics$mae_train,
  MAE_test = nn_metrics$mae_test,
  stringsAsFactors = FALSE
)

# Append the new row to df_results
df_results <- rbind(df_results, nn_model_row)
view_model_results(df_results, generate_table_caption("Neural Network Model Addition", section = 9))

9.3 Logistic Regression

9.3.1 Data Processing

# Define the training and test datasets
train_df_logis <- train_df_non_linear
test_df_logis <- test_df_non_linear

# Convert "price" to binary class labels based on median
train_df_logis$price <- ifelse(train_df_logis$price > median(train_df_logis$price), 1, 0)
test_df_logis$price <- ifelse(test_df_logis$price > median(test_df_logis$price), 1, 0)

# Convert "price" to factor with labels "low" and "high"
train_df_logis$price <- factor(train_df_logis$price, levels = c("0", "1"), labels = c("low", "high"))
test_df_logis$price <- factor(test_df_logis$price, levels = c("0", "1"), labels = c("low", "high"))

In this section, we preprocess the data for binary logistic regression. We load the necessary libraries and define the training and test datasets. The “price” column is transformed into binary class labels based on the median, and it is converted to a factor with labels “low” and “high.”

9.3.2 Data Structure and Summary

# Display structure and head of the datasets
str(train_df_logis)
## 'data.frame':    15129 obs. of  102 variables:
##  $ price                  : Factor w/ 2 levels "low","high": 2 1 2 1 1 2 1 2 2 1 ...
##  $ bedrooms               : int  5 3 4 4 3 3 4 4 4 3 ...
##  $ bathrooms              : num  1 1.75 2.75 2.5 1.75 2.5 2.75 3.5 2.5 2.25 ...
##  $ sqft_living            : int  2590 1960 2810 2000 1660 2180 2420 3490 2310 2140 ...
##  $ sqft_lot               : int  4652 7705 10300 13300 6250 4533 8438 87497 7555 7200 ...
##  $ floors                 : num  2 1 1 1 1 2 2 2 2 2 ...
##  $ sqft_basement          : int  280 980 1000 800 830 0 0 0 0 0 ...
##  $ yr_built               : int  1907 1950 1978 1968 1980 2010 1997 2001 1997 1979 ...
##  $ yr_renovated           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sqft_living15          : int  2360 1380 2710 1800 1660 2180 2270 2400 1980 1890 ...
##  $ sqft_lot15             : int  4650 4349 9900 9810 5750 7347 8770 55657 8416 7455 ...
##  $ year_sold              : num  2014 2014 2014 2014 2014 ...
##  $ day_sold               : int  9 23 8 9 9 17 9 26 25 10 ...
##  $ day_of_year            : num  282 266 220 129 190 321 68 238 56 253 ...
##  $ zipcode_98001          : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ zipcode_98002          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98003          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98004          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98005          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98006          : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ zipcode_98007          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98008          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98010          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98011          : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ zipcode_98014          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98019          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98022          : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ zipcode_98023          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98024          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98027          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98028          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98029          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98030          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98031          : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ zipcode_98032          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98033          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98034          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98038          : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ zipcode_98039          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98040          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98042          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98045          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98052          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98053          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98055          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98056          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98058          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98059          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98065          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98070          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98072          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98074          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98075          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98077          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98092          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98102          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98103          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98105          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98106          : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98107          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98108          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98109          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98112          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98115          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98116          : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ zipcode_98117          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98118          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98119          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98122          : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98125          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98126          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98133          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98136          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98144          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98146          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98148          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98155          : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ zipcode_98166          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98168          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98177          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98178          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98188          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98198          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98199          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view_0                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ view_1                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view_2                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view_3                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view_4                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition_1            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition_2            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition_4            : int  1 1 1 1 0 0 0 0 0 1 ...
##  $ condition_5            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ grade_Above_Average    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ grade_Average          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ waterfront_0           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ waterfront_1           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ season_Winter          : int  0 0 0 0 0 0 1 0 1 0 ...
##  $ season_Spring          : int  0 0 0 1 0 0 0 0 0 0 ...
##   [list output truncated]
str(test_df_logis)
## 'data.frame':    6484 obs. of  102 variables:
##  $ price                  : Factor w/ 2 levels "low","high": 1 2 2 1 1 1 1 2 1 1 ...
##  $ bedrooms               : int  2 3 4 3 3 3 3 4 3 3 ...
##  $ bathrooms              : num  1 2 4.5 2.25 1.5 1 1 3 2 1 ...
##  $ sqft_living            : int  770 1680 5420 1715 1060 1780 1430 2950 1890 1250 ...
##  $ sqft_lot               : int  10000 8080 101930 6819 9711 7470 19901 5000 14040 9774 ...
##  $ floors                 : num  1 1 1 2 1 1 1.5 2 2 1 ...
##  $ sqft_basement          : int  0 0 1530 0 0 730 0 970 0 0 ...
##  $ yr_built               : int  1933 1987 2001 1995 1963 1960 1927 1979 1994 1969 ...
##  $ yr_renovated           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sqft_living15          : int  2720 1800 4760 2238 1650 1780 1780 2140 1890 1280 ...
##  $ sqft_lot15             : int  8062 7503 101930 6819 9711 8113 12697 4000 14018 8850 ...
##  $ year_sold              : num  2015 2015 2014 2014 2015 ...
##  $ day_sold               : int  25 18 12 27 15 15 28 24 31 24 ...
##  $ day_of_year            : num  56 49 132 178 15 105 148 24 212 114 ...
##  $ zipcode_98001          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98002          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98003          : int  0 0 0 1 0 0 0 0 0 1 ...
##  $ zipcode_98004          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98005          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98006          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98007          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98008          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98010          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98011          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98014          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98019          : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ zipcode_98022          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98023          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98024          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98027          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98028          : int  1 0 0 0 0 0 1 0 0 0 ...
##  $ zipcode_98029          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98030          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98031          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98032          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98033          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98034          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98038          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98039          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98040          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98042          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98045          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98052          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98053          : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ zipcode_98055          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98056          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98058          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98059          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98065          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98070          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98072          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98074          : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98075          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98077          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98092          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98102          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98103          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98105          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98106          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98107          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98108          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98109          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98112          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98115          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98116          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98117          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98118          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98119          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98122          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98125          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98126          : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ zipcode_98133          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98136          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98144          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98146          : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ zipcode_98148          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98155          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98166          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98168          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98177          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98178          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98188          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98198          : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ zipcode_98199          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view_0                 : int  1 1 1 1 1 1 1 0 1 1 ...
##  $ view_1                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view_2                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view_3                 : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ view_4                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition_1            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition_2            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition_4            : int  0 0 0 0 0 0 1 0 0 1 ...
##  $ condition_5            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ grade_Above_Average    : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ grade_Average          : int  1 1 0 1 1 1 1 1 1 1 ...
##  $ waterfront_0           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ waterfront_1           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ season_Winter          : int  1 1 0 0 1 0 0 1 0 0 ...
##  $ season_Spring          : int  0 0 1 1 0 1 1 0 0 1 ...
##   [list output truncated]
head(train_df_logis)
##   price bedrooms bathrooms sqft_living sqft_lot floors sqft_basement yr_built
## 1  high        5      1.00        2590     4652      2           280     1907
## 2   low        3      1.75        1960     7705      1           980     1950
## 3  high        4      2.75        2810    10300      1          1000     1978
## 4   low        4      2.50        2000    13300      1           800     1968
## 5   low        3      1.75        1660     6250      1           830     1980
## 6  high        3      2.50        2180     4533      2             0     2010
##   yr_renovated sqft_living15 sqft_lot15 year_sold day_sold day_of_year
## 1            0          2360       4650      2014        9         282
## 2            0          1380       4349      2014       23         266
## 3            0          2710       9900      2014        8         220
## 4            0          1800       9810      2014        9         129
## 5            0          1660       5750      2014        9         190
## 6            0          2180       7347      2014       17         321
##   zipcode_98001 zipcode_98002 zipcode_98003 zipcode_98004 zipcode_98005
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             1             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98006 zipcode_98007 zipcode_98008 zipcode_98010 zipcode_98011
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             1             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             1
##   zipcode_98014 zipcode_98019 zipcode_98022 zipcode_98023 zipcode_98024
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98027 zipcode_98028 zipcode_98029 zipcode_98030 zipcode_98031
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98032 zipcode_98033 zipcode_98034 zipcode_98038 zipcode_98039
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98040 zipcode_98042 zipcode_98045 zipcode_98052 zipcode_98053
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98055 zipcode_98056 zipcode_98058 zipcode_98059 zipcode_98065
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98070 zipcode_98072 zipcode_98074 zipcode_98075 zipcode_98077
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98092 zipcode_98102 zipcode_98103 zipcode_98105 zipcode_98106
## 1             0             0             0             0             0
## 2             0             0             0             0             1
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98107 zipcode_98108 zipcode_98109 zipcode_98112 zipcode_98115
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98116 zipcode_98117 zipcode_98118 zipcode_98119 zipcode_98122
## 1             0             0             0             0             1
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             1             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98125 zipcode_98126 zipcode_98133 zipcode_98136 zipcode_98144
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98146 zipcode_98148 zipcode_98155 zipcode_98166 zipcode_98168
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98177 zipcode_98178 zipcode_98188 zipcode_98198 zipcode_98199 view_0
## 1             0             0             0             0             0      1
## 2             0             0             0             0             0      1
## 3             0             0             0             0             0      1
## 4             0             0             0             0             0      1
## 5             0             0             0             0             0      1
## 6             0             0             0             0             0      1
##   view_1 view_2 view_3 view_4 condition_1 condition_2 condition_4 condition_5
## 1      0      0      0      0           0           0           1           0
## 2      0      0      0      0           0           0           1           0
## 3      0      0      0      0           0           0           1           0
## 4      0      0      0      0           0           0           1           0
## 5      0      0      0      0           0           0           0           0
## 6      0      0      0      0           0           0           0           0
##   grade_Above_Average grade_Average waterfront_0 waterfront_1 season_Winter
## 1                   0             1            1            0             0
## 2                   0             1            1            0             0
## 3                   0             1            1            0             0
## 4                   0             1            1            0             0
## 5                   0             1            1            0             0
## 6                   0             1            1            0             0
##   season_Spring season_Summer season_Fall distance_to_convergence
## 1             0             0           1                5.094525
## 2             0             1           0               13.117306
## 3             0             1           0                8.516639
## 4             1             0           0               29.761115
## 5             0             1           0               12.202504
## 6             0             0           1               15.268135
head(test_df_logis)
##   price bedrooms bathrooms sqft_living sqft_lot floors sqft_basement yr_built
## 1   low        2      1.00         770    10000      1             0     1933
## 2  high        3      2.00        1680     8080      1             0     1987
## 3  high        4      4.50        5420   101930      1          1530     2001
## 4   low        3      2.25        1715     6819      2             0     1995
## 5   low        3      1.50        1060     9711      1             0     1963
## 6   low        3      1.00        1780     7470      1           730     1960
##   yr_renovated sqft_living15 sqft_lot15 year_sold day_sold day_of_year
## 1            0          2720       8062      2015       25          56
## 2            0          1800       7503      2015       18          49
## 3            0          4760     101930      2014       12         132
## 4            0          2238       6819      2014       27         178
## 5            0          1650       9711      2015       15          15
## 6            0          1780       8113      2015       15         105
##   zipcode_98001 zipcode_98002 zipcode_98003 zipcode_98004 zipcode_98005
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             0             0             1             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98006 zipcode_98007 zipcode_98008 zipcode_98010 zipcode_98011
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98014 zipcode_98019 zipcode_98022 zipcode_98023 zipcode_98024
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98027 zipcode_98028 zipcode_98029 zipcode_98030 zipcode_98031
## 1             0             1             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98032 zipcode_98033 zipcode_98034 zipcode_98038 zipcode_98039
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98040 zipcode_98042 zipcode_98045 zipcode_98052 zipcode_98053
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             1
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98055 zipcode_98056 zipcode_98058 zipcode_98059 zipcode_98065
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98070 zipcode_98072 zipcode_98074 zipcode_98075 zipcode_98077
## 1             0             0             0             0             0
## 2             0             0             1             0             0
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98092 zipcode_98102 zipcode_98103 zipcode_98105 zipcode_98106
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98107 zipcode_98108 zipcode_98109 zipcode_98112 zipcode_98115
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98116 zipcode_98117 zipcode_98118 zipcode_98119 zipcode_98122
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98125 zipcode_98126 zipcode_98133 zipcode_98136 zipcode_98144
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             0             0             0             0             0
##   zipcode_98146 zipcode_98148 zipcode_98155 zipcode_98166 zipcode_98168
## 1             0             0             0             0             0
## 2             0             0             0             0             0
## 3             0             0             0             0             0
## 4             0             0             0             0             0
## 5             0             0             0             0             0
## 6             1             0             0             0             0
##   zipcode_98177 zipcode_98178 zipcode_98188 zipcode_98198 zipcode_98199 view_0
## 1             0             0             0             0             0      1
## 2             0             0             0             0             0      1
## 3             0             0             0             0             0      1
## 4             0             0             0             0             0      1
## 5             0             0             0             1             0      1
## 6             0             0             0             0             0      1
##   view_1 view_2 view_3 view_4 condition_1 condition_2 condition_4 condition_5
## 1      0      0      0      0           0           0           0           0
## 2      0      0      0      0           0           0           0           0
## 3      0      0      0      0           0           0           0           0
## 4      0      0      0      0           0           0           0           0
## 5      0      0      0      0           0           0           0           0
## 6      0      0      0      0           0           0           0           0
##   grade_Above_Average grade_Average waterfront_0 waterfront_1 season_Winter
## 1                   0             1            1            0             1
## 2                   0             1            1            0             1
## 3                   1             0            1            0             0
## 4                   0             1            1            0             0
## 5                   0             1            1            0             1
## 6                   0             1            1            0             0
##   season_Spring season_Summer season_Fall distance_to_convergence
## 1             0             0           0                13.44572
## 2             0             0           0                13.78622
## 3             1             0           0                17.33194
## 4             1             0           0                34.95768
## 5             0             0           0                23.96256
## 6             1             0           0                14.18864
# Display unique values of "price" in both datasets
unique(train_df_logis$price)
## [1] high low 
## Levels: low high
unique(test_df_logis$price)
## [1] low  high
## Levels: low high

In this section, we examine the structure and summary of the datasets. We display the structure, head, and unique values of the “price” column in both the training and test datasets.

9.3.3 Data Visualization

Figure 9.1 Histogram of Train Price Data

Figure 9.2 Historgram of Test Price Data

This section includes histograms of the price column in both datasets to visualize the distribution of class labels.

9.3.4 Binary Logistic Regression

# Fit a binary logistic regression model
fit_logistic <- glm(price ~ ., family = binomial, data = train_df_logis)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(fit_logistic)
## 
## Call:
## glm(formula = price ~ ., family = binomial, data = train_df_logis)
## 
## Coefficients: (4 not defined because of singularities)
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -1.714e+03  2.167e+03  -0.791 0.428948    
## bedrooms                -1.190e-01  4.282e-02  -2.778 0.005465 ** 
## bathrooms                3.815e-01  7.874e-02   4.844 1.27e-06 ***
## sqft_living              3.337e-03  1.119e-04  29.823  < 2e-16 ***
## sqft_lot                 1.149e-05  1.657e-06   6.933 4.13e-12 ***
## floors                  -4.045e-01  8.958e-02  -4.515 6.33e-06 ***
## sqft_basement           -1.335e-03  1.182e-04 -11.291  < 2e-16 ***
## yr_built                -9.408e-05  1.797e-03  -0.052 0.958239    
## yr_renovated             2.762e-04  8.855e-05   3.119 0.001817 ** 
## sqft_living15            1.438e-03  9.876e-05  14.558  < 2e-16 ***
## sqft_lot15               3.727e-06  2.188e-06   1.703 0.088486 .  
## year_sold                8.493e-01  1.274e-01   6.666 2.64e-11 ***
## day_sold                -5.903e-03  3.753e-03  -1.573 0.115804    
## day_of_year              6.365e-04  1.408e-03   0.452 0.651299    
## zipcode_98001           -3.153e+00  5.825e-01  -5.412 6.23e-08 ***
## zipcode_98002           -1.634e+01  2.629e+02  -0.062 0.950437    
## zipcode_98003           -2.995e+00  6.300e-01  -4.754 1.99e-06 ***
## zipcode_98004           -7.863e-01  1.077e+00  -0.730 0.465285    
## zipcode_98005           -1.406e+00  1.098e+00  -1.281 0.200211    
## zipcode_98006           -3.538e+00  3.730e-01  -9.488  < 2e-16 ***
## zipcode_98007           -3.580e+00  4.198e-01  -8.529  < 2e-16 ***
## zipcode_98008           -2.979e+00  3.586e-01  -8.307  < 2e-16 ***
## zipcode_98010           -4.836e-01  7.038e-01  -0.687 0.492017    
## zipcode_98011           -3.362e+00  3.671e-01  -9.158  < 2e-16 ***
## zipcode_98014           -2.227e+00  5.951e-01  -3.742 0.000182 ***
## zipcode_98019           -3.136e+00  4.318e-01  -7.262 3.82e-13 ***
## zipcode_98022           -7.031e-01  8.330e-01  -0.844 0.398669    
## zipcode_98023           -4.360e+00  6.638e-01  -6.568 5.11e-11 ***
## zipcode_98024           -1.410e+00  6.249e-01  -2.257 0.023993 *  
## zipcode_98027           -1.964e+00  3.408e-01  -5.763 8.27e-09 ***
## zipcode_98028           -4.204e+00  3.394e-01 -12.386  < 2e-16 ***
## zipcode_98029           -3.026e-01  3.589e-01  -0.843 0.399200    
## zipcode_98030           -4.899e+00  5.685e-01  -8.618  < 2e-16 ***
## zipcode_98031           -6.802e+00  6.296e-01 -10.804  < 2e-16 ***
## zipcode_98032           -8.554e+00  1.461e+00  -5.855 4.77e-09 ***
## zipcode_98033           -2.415e+00  3.531e-01  -6.840 7.94e-12 ***
## zipcode_98034           -3.864e+00  2.993e-01 -12.908  < 2e-16 ***
## zipcode_98038           -3.879e+00  4.841e-01  -8.013 1.12e-15 ***
## zipcode_98039            9.476e+00  4.539e+02   0.021 0.983346    
## zipcode_98040            1.006e+01  2.020e+02   0.050 0.960270    
## zipcode_98042           -5.264e+00  4.720e-01 -11.152  < 2e-16 ***
## zipcode_98045            1.696e+00  6.255e-01   2.712 0.006687 ** 
## zipcode_98052           -2.584e+00  3.175e-01  -8.138 4.02e-16 ***
## zipcode_98053           -1.411e+00  3.586e-01  -3.933 8.38e-05 ***
## zipcode_98055           -6.617e+00  4.918e-01 -13.454  < 2e-16 ***
## zipcode_98056           -6.212e+00  3.382e-01 -18.368  < 2e-16 ***
## zipcode_98058           -6.105e+00  3.937e-01 -15.506  < 2e-16 ***
## zipcode_98059           -5.730e+00  3.358e-01 -17.065  < 2e-16 ***
## zipcode_98065           -1.106e+00  4.529e-01  -2.442 0.014614 *  
## zipcode_98070           -1.386e+00  5.433e-01  -2.552 0.010704 *  
## zipcode_98072           -2.808e+00  3.519e-01  -7.980 1.46e-15 ***
## zipcode_98074           -1.943e+00  3.528e-01  -5.506 3.67e-08 ***
## zipcode_98075           -1.458e+00  5.429e-01  -2.686 0.007235 ** 
## zipcode_98077           -2.956e+00  4.718e-01  -6.265 3.73e-10 ***
## zipcode_98092           -4.109e+00  5.829e-01  -7.050 1.80e-12 ***
## zipcode_98102           -8.387e-01  4.927e-01  -1.702 0.088671 .  
## zipcode_98103           -1.031e+00  2.855e-01  -3.610 0.000306 ***
## zipcode_98105            5.085e-02  4.941e-01   0.103 0.918035    
## zipcode_98106           -6.773e+00  5.305e-01 -12.766  < 2e-16 ***
## zipcode_98107            3.602e-01  3.260e-01   1.105 0.269089    
## zipcode_98108           -5.859e+00  3.958e-01 -14.804  < 2e-16 ***
## zipcode_98109            3.024e-01  5.824e-01   0.519 0.603531    
## zipcode_98112           -7.640e-01  5.338e-01  -1.431 0.152389    
## zipcode_98115           -1.253e+00  2.979e-01  -4.206 2.60e-05 ***
## zipcode_98116           -6.535e-01  3.147e-01  -2.076 0.037855 *  
## zipcode_98117            6.791e-02  2.892e-01   0.235 0.814342    
## zipcode_98118           -4.961e+00  3.192e-01 -15.543  < 2e-16 ***
## zipcode_98119            8.679e-01  5.192e-01   1.672 0.094577 .  
## zipcode_98122           -2.522e+00  3.506e-01  -7.193 6.33e-13 ***
## zipcode_98125           -3.232e+00  3.038e-01 -10.639  < 2e-16 ***
## zipcode_98126           -3.039e+00  3.307e-01  -9.192  < 2e-16 ***
## zipcode_98133           -3.396e+00  3.135e-01 -10.835  < 2e-16 ***
## zipcode_98136           -1.516e+00  3.337e-01  -4.542 5.56e-06 ***
## zipcode_98144           -4.482e+00  3.479e-01 -12.881  < 2e-16 ***
## zipcode_98146           -4.119e+00  3.811e-01 -10.809  < 2e-16 ***
## zipcode_98148           -5.725e+00  1.042e+00  -5.494 3.92e-08 ***
## zipcode_98155           -3.976e+00  3.271e-01 -12.154  < 2e-16 ***
## zipcode_98166           -3.728e+00  3.905e-01  -9.547  < 2e-16 ***
## zipcode_98168           -8.689e+00  1.285e+00  -6.764 1.34e-11 ***
## zipcode_98177           -1.912e+00  3.499e-01  -5.465 4.63e-08 ***
## zipcode_98178           -8.215e+00  6.046e-01 -13.587  < 2e-16 ***
## zipcode_98188           -6.176e+00  7.245e-01  -8.525  < 2e-16 ***
## zipcode_98198           -4.896e+00  5.516e-01  -8.876  < 2e-16 ***
## zipcode_98199                   NA         NA      NA       NA    
## view_0                  -3.397e+00  6.665e-01  -5.097 3.45e-07 ***
## view_1                  -2.343e+00  7.209e-01  -3.250 0.001152 ** 
## view_2                  -2.249e+00  6.769e-01  -3.323 0.000892 ***
## view_3                  -1.538e+00  6.973e-01  -2.205 0.027452 *  
## view_4                          NA         NA      NA       NA    
## condition_1             -2.611e+00  7.649e-01  -3.414 0.000640 ***
## condition_2             -1.369e+00  4.601e-01  -2.976 0.002921 ** 
## condition_4              4.537e-01  8.318e-02   5.455 4.91e-08 ***
## condition_5              1.035e+00  1.330e-01   7.780 7.27e-15 ***
## grade_Above_Average      9.196e+00  2.152e+03   0.004 0.996590    
## grade_Average            8.048e+00  2.152e+03   0.004 0.997016    
## waterfront_0            -2.883e+00  7.246e-01  -3.979 6.93e-05 ***
## waterfront_1                    NA         NA      NA       NA    
## season_Winter           -1.352e-01  3.472e-01  -0.389 0.696955    
## season_Spring            4.149e-01  2.490e-01   1.666 0.095626 .  
## season_Summer            2.566e-01  1.548e-01   1.658 0.097260 .  
## season_Fall                     NA         NA      NA       NA    
## distance_to_convergence -2.697e-01  2.011e-02 -13.411  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20973.2  on 15128  degrees of freedom
## Residual deviance:  6773.4  on 15031  degrees of freedom
## AIC: 6969.4
## 
## Number of Fisher Scoring iterations: 16

Here, we fit a binary logistic regression model using the glm function. We specify the family parameter as “binomial” for logistic regression and provide the training data. We then display a summary of the fitted model.

9.3.5 Deviance and Hypothesis Testing

# Compute the null and residual deviances
summary_fit_logistic <- summary(fit_logistic)
null_deviance <- summary_fit_logistic$null.deviance
residual_deviance <- summary_fit_logistic$deviance
df_null <- summary_fit_logistic$df.null
df_residual <- summary_fit_logistic$df.residual

# Calculate the difference in deviances and degrees of freedom
diff_deviances <- null_deviance - residual_deviance
diff_df <- df_null - df_residual

# Perform a chi-squared test to assess variable importance
p_value <- 1 - pchisq(diff_deviances, diff_df)

# Determine if variables are important based on p-value
if (p_value < 0.05) {
  conclusion <- "Variables are important (Reject Null Hypothesis)"
} else {
  conclusion <- "Variables are not important (Fail to Reject Null Hypothesis)"
}

In this subsection, we compute the null and residual deviance’s of the logistic regression model. We then calculate the difference in deviance’s and degrees of freedom to perform a chi-squared test to assess the importance of variables. The result is used to determine whether variables are important or not.

9.3.6 Stepwise Selection with AIC

# Function to perform and evaluate stepwise logistic regression
perform_and_evaluate_logistic_regression <- function(train_df, test_df, fit_logistic_step) {
  # Model evaluation on training data
  true_labels_train <- ifelse(train_df$price == "high", 1, 0)
  predicted_probabilities_train <- predict(fit_logistic_step, train_df, type = "response")
  predictions_train <- ifelse(predicted_probabilities_train > 0.5, 1, 0)
  confusion_matrix_train <- caret::confusionMatrix(as.factor(predictions_train), as.factor(true_labels_train))

  # Accuracy and other metrics from confusion matrix for training data
  accuracy_train <- confusion_matrix_train$overall['Accuracy']
  precision_train <- confusion_matrix_train$byClass['Precision']
  recall_train <- confusion_matrix_train$byClass['Recall']
  F1_train <- 2 * (precision_train * recall_train) / (precision_train + recall_train)

  # Pseudo R-squared for training data
  pseudo_r_squared_train <- 1 - fit_logistic_step$deviance / fit_logistic_step$null.deviance

  # Model evaluation on test data
  true_labels_test <- ifelse(test_df$price == "high", 1, 0)
  predicted_probabilities_test <- predict(fit_logistic_step, test_df, type = "response")
  predictions_test <- ifelse(predicted_probabilities_test > 0.5, 1, 0)
  confusion_matrix_test <- caret::confusionMatrix(as.factor(predictions_test), as.factor(true_labels_test))

  # Accuracy and other metrics from confusion matrix for test data
  accuracy_test <- confusion_matrix_test$overall['Accuracy']
  precision_test <- confusion_matrix_test$byClass['Precision']
  recall_test <- confusion_matrix_test$byClass['Recall']
  F1_test <- 2 * (precision_test * recall_test) / (precision_test + recall_test)

  # Pseudo R-squared for test data
  pseudo_r_squared_test <- 1 - fit_logistic_step$deviance / fit_logistic_step$null.deviance

  # Extracting feature names from the model formula
  feature_names <- all.vars(formula(fit_logistic_step))

  # Remove the response variable name ('price') from the list
  feature_names <- setdiff(feature_names, "price")

  # Store all results in a list
  results <- list(
    features = feature_names,
    coefficients = coef(fit_logistic_step),
    accuracy_train = accuracy_train,
    precision_train = precision_train,
    recall_train = recall_train,
    F1_train = F1_train,
    pseudo_r_squared_train = pseudo_r_squared_train,
    accuracy_test = accuracy_test,
    precision_test = precision_test,
    recall_test = recall_test,
    F1_test = F1_test
  )

  return(results)
}

if (update_model_parameters) {
  # Perform and evaluate logistic regression, then store results
  # Step fit the model
  fit_logistic_step <- step(fit_logistic, trace = 0)

  logistic_results <- perform_and_evaluate_logistic_regression(train_df_logis, test_df_logis, fit_logistic_step)

  # Prepare a list of logistic regression parameters and metrics for JSON update
  logistic_json_data <- list(
    features = logistic_results$features,
    coefficients = logistic_results$coefficients,
    accuracy_train = logistic_results$accuracy_train,
    precision_train = logistic_results$precision_train,
    recall_train = logistic_results$recall_train,
    F1_train = logistic_results$F1_train,
    pseudo_r_squared_train = logistic_results$pseudo_r_squared_train,
    accuracy_test = logistic_results$accuracy_test,
    precision_test = logistic_results$precision_test,
    recall_test = logistic_results$recall_test,
    F1_test = logistic_results$F1_test
  )

  # Update logistic regression results in JSON
  update_model_json("StepwiseLogisticRegression", logistic_json_data, json_filepath)
} else {
  # Load logistic regression results from JSON
  loaded_data <- fromJSON(json_filepath)$StepwiseLogisticRegression

  # Check if features are correctly loaded
  if (is.null(loaded_data$features) || length(loaded_data$features) == 0) {
    stop("No features found in the loaded model data.")
  }

  # Rebuild the formula from the loaded features
  formula_string <- paste("price ~", paste(loaded_data$features, collapse = " + "))
  fit_logistic_step <- glm(as.formula(formula_string), family = binomial, data = train_df_logis)
  logistic_results <- perform_and_evaluate_logistic_regression(train_df_logis, test_df_logis, fit_logistic_step)
}
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Prepare results for the dataframe
logistic_model_row <- data.frame(
  Model = "Stepwise Logistic Regression",
  Accuracy_train = logistic_results$accuracy_train,
  Accuracy_test = logistic_results$accuracy_test,
  Precision_train = logistic_results$precision_train,
  Precision_test = logistic_results$precision_test,
  Recall_train = logistic_results$recall_train,
  Recall_test = logistic_results$recall_test,
  F1_train = logistic_results$F1_train,
  F1_test = logistic_results$F1_test,
  Pseudo_R_squared_train = logistic_results$pseudo_r_squared_train,
  stringsAsFactors = FALSE
)

summary(fit_logistic_step)
## 
## Call:
## glm(formula = as.formula(formula_string), family = binomial, 
##     data = train_df_logis)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -1.713e+03  2.547e+02  -6.727 1.73e-11 ***
## bedrooms                -1.209e-01  4.256e-02  -2.840 0.004518 ** 
## bathrooms                3.701e-01  7.179e-02   5.155 2.54e-07 ***
## sqft_living              3.339e-03  1.106e-04  30.182  < 2e-16 ***
## sqft_lot                 1.137e-05  1.626e-06   6.996 2.63e-12 ***
## floors                  -4.016e-01  8.052e-02  -4.987 6.12e-07 ***
## sqft_basement           -1.325e-03  1.167e-04 -11.352  < 2e-16 ***
## yr_renovated             2.777e-04  8.479e-05   3.275 0.001057 ** 
## sqft_living15            1.429e-03  9.824e-05  14.548  < 2e-16 ***
## sqft_lot15               3.752e-06  2.171e-06   1.729 0.083883 .  
## year_sold                8.527e-01  1.264e-01   6.747 1.51e-11 ***
## day_sold                -6.395e-03  3.585e-03  -1.784 0.074436 .  
## day_of_year              1.134e-03  6.048e-04   1.876 0.060690 .  
## zipcode_98001           -2.814e+00  4.071e-01  -6.912 4.77e-12 ***
## zipcode_98002           -1.500e+01  1.592e+02  -0.094 0.924947    
## zipcode_98003           -2.658e+00  4.754e-01  -5.591 2.26e-08 ***
## zipcode_98006           -3.581e+00  2.900e-01 -12.347  < 2e-16 ***
## zipcode_98007           -3.648e+00  3.431e-01 -10.634  < 2e-16 ***
## zipcode_98008           -3.021e+00  2.723e-01 -11.095  < 2e-16 ***
## zipcode_98011           -3.300e+00  2.846e-01 -11.596  < 2e-16 ***
## zipcode_98014           -2.025e+00  5.103e-01  -3.967 7.27e-05 ***
## zipcode_98019           -2.934e+00  3.075e-01  -9.543  < 2e-16 ***
## zipcode_98023           -3.908e+00  4.769e-01  -8.196 2.48e-16 ***
## zipcode_98024           -1.182e+00  5.340e-01  -2.215 0.026793 *  
## zipcode_98027           -1.858e+00  2.329e-01  -7.977 1.50e-15 ***
## zipcode_98028           -4.137e+00  2.464e-01 -16.787  < 2e-16 ***
## zipcode_98030           -4.642e+00  4.519e-01 -10.272  < 2e-16 ***
## zipcode_98031           -6.605e+00  5.541e-01 -11.920  < 2e-16 ***
## zipcode_98032           -8.289e+00  1.414e+00  -5.862 4.59e-09 ***
## zipcode_98033           -2.456e+00  2.651e-01  -9.266  < 2e-16 ***
## zipcode_98034           -3.850e+00  1.961e-01 -19.635  < 2e-16 ***
## zipcode_98038           -3.569e+00  2.874e-01 -12.418  < 2e-16 ***
## zipcode_98042           -4.981e+00  3.124e-01 -15.947  < 2e-16 ***
## zipcode_98045            2.108e+00  3.921e-01   5.377 7.59e-08 ***
## zipcode_98052           -2.592e+00  2.204e-01 -11.763  < 2e-16 ***
## zipcode_98053           -1.312e+00  2.553e-01  -5.139 2.76e-07 ***
## zipcode_98055           -6.513e+00  4.266e-01 -15.268  < 2e-16 ***
## zipcode_98056           -6.195e+00  2.493e-01 -24.848  < 2e-16 ***
## zipcode_98058           -5.951e+00  2.868e-01 -20.754  < 2e-16 ***
## zipcode_98059           -5.657e+00  2.372e-01 -23.849  < 2e-16 ***
## zipcode_98065           -8.357e-01  2.785e-01  -3.000 0.002699 ** 
## zipcode_98070           -1.134e+00  4.162e-01  -2.724 0.006454 ** 
## zipcode_98072           -2.720e+00  2.571e-01 -10.581  < 2e-16 ***
## zipcode_98074           -1.902e+00  2.678e-01  -7.103 1.22e-12 ***
## zipcode_98075           -1.388e+00  4.893e-01  -2.837 0.004560 ** 
## zipcode_98077           -2.824e+00  3.939e-01  -7.170 7.47e-13 ***
## zipcode_98092           -3.750e+00  3.889e-01  -9.642  < 2e-16 ***
## zipcode_98102           -9.040e-01  4.335e-01  -2.086 0.037011 *  
## zipcode_98103           -1.037e+00  1.762e-01  -5.887 3.94e-09 ***
## zipcode_98106           -6.747e+00  4.804e-01 -14.046  < 2e-16 ***
## zipcode_98107            3.780e-01  2.359e-01   1.602 0.109103    
## zipcode_98108           -5.890e+00  3.232e-01 -18.223  < 2e-16 ***
## zipcode_98112           -8.584e-01  4.736e-01  -1.813 0.069882 .  
## zipcode_98115           -1.283e+00  1.947e-01  -6.591 4.37e-11 ***
## zipcode_98116           -6.263e-01  2.203e-01  -2.843 0.004474 ** 
## zipcode_98118           -4.996e+00  2.229e-01 -22.410  < 2e-16 ***
## zipcode_98119            8.554e-01  4.684e-01   1.826 0.067809 .  
## zipcode_98122           -2.610e+00  2.512e-01 -10.388  < 2e-16 ***
## zipcode_98125           -3.219e+00  2.056e-01 -15.652  < 2e-16 ***
## zipcode_98126           -3.008e+00  2.417e-01 -12.447  < 2e-16 ***
## zipcode_98133           -3.324e+00  2.075e-01 -16.023  < 2e-16 ***
## zipcode_98136           -1.468e+00  2.437e-01  -6.026 1.68e-09 ***
## zipcode_98144           -4.557e+00  2.518e-01 -18.100  < 2e-16 ***
## zipcode_98146           -4.043e+00  2.985e-01 -13.545  < 2e-16 ***
## zipcode_98148           -5.578e+00  1.004e+00  -5.555 2.77e-08 ***
## zipcode_98155           -3.893e+00  2.251e-01 -17.294  < 2e-16 ***
## zipcode_98166           -3.584e+00  2.858e-01 -12.541  < 2e-16 ***
## zipcode_98168           -8.621e+00  1.261e+00  -6.839 8.00e-12 ***
## zipcode_98177           -1.816e+00  2.511e-01  -7.232 4.75e-13 ***
## zipcode_98178           -8.189e+00  5.618e-01 -14.578  < 2e-16 ***
## zipcode_98188           -6.048e+00  6.750e-01  -8.959  < 2e-16 ***
## zipcode_98198           -4.661e+00  4.444e-01 -10.491  < 2e-16 ***
## view_0                  -3.401e+00  6.640e-01  -5.122 3.03e-07 ***
## view_1                  -2.351e+00  7.185e-01  -3.272 0.001069 ** 
## view_2                  -2.257e+00  6.743e-01  -3.347 0.000818 ***
## view_3                  -1.563e+00  6.939e-01  -2.253 0.024287 *  
## condition_1             -2.616e+00  7.633e-01  -3.428 0.000609 ***
## condition_2             -1.382e+00  4.626e-01  -2.989 0.002803 ** 
## condition_4              4.548e-01  7.989e-02   5.692 1.25e-08 ***
## condition_5              1.040e+00  1.276e-01   8.150 3.63e-16 ***
## waterfront_0            -2.885e+00  7.217e-01  -3.997 6.41e-05 ***
## season_Spring            5.045e-01  8.377e-02   6.023 1.72e-09 ***
## season_Summer            3.048e-01  8.806e-02   3.462 0.000537 ***
## distance_to_convergence -2.848e-01  1.166e-02 -24.413  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20973.2  on 15128  degrees of freedom
## Residual deviance:  6779.1  on 15045  degrees of freedom
## AIC: 6947.1
## 
## Number of Fisher Scoring iterations: 15

9.3.7 Stepwise Logistic Regression Model Performance

A “Stepwise Logistic Regression” model was applied to predict a binary outcome. The model demonstrated good performance on both the training and test datasets.

9.3.8 Training Dataset Metrics

For the training dataset, the model’s performance metrics were as follows:

  • Accuracy: The accuracy of the model was found to be approximately 90.79%, indicating that it correctly classified a large proportion of the observations.

  • Precision: The precision, which measures the proportion of true positive predictions among all positive predictions, was approximately 89.81%. This suggests that when the model predicted a positive outcome, it was accurate nearly 90% of the time.

  • Recall: The recall, also known as sensitivity, was about 91.13%. This metric assesses the model’s ability to correctly identify true positive cases out of all actual positive cases. In this context, the model effectively captured a significant portion of the true positive cases.

  • F1-Score: The F1-score, a balanced measure of precision and recall, was found to be approximately 90.31%. This score indicates a good balance between the ability to make accurate positive predictions and capturing actual positive cases.

  • Pseudo R-squared: The pseudo R-squared value, which measures the goodness of fit of the logistic regression model, was approximately 90.41%. A higher pseudo R-squared suggests that the model explains a significant portion of the variance in the data.

9.3.9 Test Dataset Metrics

When evaluated on the test dataset, the “Stepwise Logistic Regression” model displayed similar performance trends. The model’s performance metrics on the test dataset were as follows:

  • Accuracy: The accuracy on the test data was approximately 89.45%.

  • Precision: Precision on the test data was about 89.88%.

  • Recall: Recall on the test data was approximately 91.13%.

  • F1-Score: The F1-score on the test dataset was approximately 89.88%, indicating that the model maintained a balanced performance between precision and recall.

  • Pseudo R-squared: The pseudo R-squared value for the test dataset was about 67.68%, suggesting that the model explained a substantial portion of the variance in the test data, although slightly lower than on the training data.


10. Model Evaluation on Test Data

10.1 Neural Network Model Analysis

10.1.1 Results and Interpretation

Model Performance

The Neural Network model demonstrates impressive performance in predicting housing prices in King County, with substantial R-squared values of 0.9404 and 0.8734 for training and testing datasets, respectively. This indicates a high level of predictive accuracy, with the model explaining approximately 94% of the variance in the training dataset and 87% in the testing dataset. Such a high R-squared value in training suggests that the model fits the training data well, though the slight drop in the test data might hint at some overfitting.

The model’s RMSE (Root Mean Squared Error) values provide insight into the average prediction error. With RMSE values of 94,836.04 for training and 150,299.41 for testing, we observe that the model’s predictions are closer to the actual values in the training set compared to the test set. This again might indicate overfitting, where the model is tuned more closely to the training data’s specific characteristics.

The MAE (Mean Absolute Error) of 57,832.30 for training and 83,381.52 for testing reflects the model’s average absolute error in predicting housing prices. The higher MAE in the test set compared to the training set further supports the possibility of overfitting.

10.2 Transformed Model Analysis

10.2.1 Results and Interpretation

Model Performance

The Transformed Model, likely involving some form of data transformation or feature engineering, exhibits strong performance, with R-squared values of 0.87256 and 0.87105 for the training and test datasets, respectively. This consistency in R-squared values between training and testing suggests the model generalizes well without significant overfitting.

The RMSE values, standing at 133,184.41 for training and 200,676.10 for testing, are relatively high, particularly in the test set. This indicates a larger average prediction error, which might be a concern in practical applications where lower errors are desired.

The MAE values of 76,550.98 for training and 81,099.88 for testing further suggest that, on average, the model’s predictions are off by these amounts from the actual prices. The closer values of MAE between training and testing datasets indicate a consistent performance of the model across both datasets.

In summary, the Transformed Model’s consistent performance across training and testing sets is commendable, but its higher error metrics compared to the Neural Network model suggest there might be room for improvement, possibly through more refined feature engineering or model tuning.

10.3 OLS Stepwise Backward Model Analysis

10.3.1 Results and Interpretation

Model Performance

The OLS Stepwise Backward Model, a linear regression model that iteratively removes the least significant variables, shows good R-squared values of 0.81781 for training and 0.81969 for testing. These values indicate that the model explains over 81% of the variance in housing prices in both datasets, showcasing a balanced fit that does not overly favor the training data.

The RMSE values of 154,372.05 for training and 161,155.62 for testing suggest a moderate level of prediction error. While higher than what we observed in the Neural Network model, these values are acceptable for a linear model, which often trades off some predictive power for simplicity and interpretability.

The MAE values of 92,205.44 for training and 93,103.61 for testing are consistent, indicating that the model’s average prediction error remains stable across both datasets. This stability is a positive aspect, as it suggests the model’s performance is not heavily dependent on the specific characteristics of the training data.

Overall, the OLS Stepwise Backward Model is a strong contender in predicting housing prices, offering a good balance between predictive accuracy and model simplicity. Its interpretability, a key advantage of linear models, makes it a valuable tool for understanding the relationships between features and housing prices in King County.

10.4 OLS Linear Model Analysis

10.4.1 Results and Interpretation

Model Performance

The OLS Linear Model displays a commendable fit to the data with R-squared values of 0.81859 for training and 0.81966 for testing. These values suggest that the model explains approximately 82% of the variance in housing prices, both in training and testing datasets, indicating a robust model that generalizes well to new data. The similarity in R-squared values between the training and testing sets is particularly noteworthy as it implies a consistent model performance without overfitting.

Regarding error metrics, the RMSE values of 154,041.04 for training and 161,166.07 for testing suggest a moderate prediction error. While not as low as some more complex models like the Neural Network, these RMSE values are relatively acceptable for a linear regression model. The consistent RMSE between training and testing also underscores the model’s generalizability.

The MAE values, 91,957.68 for training and 93,048.1 for testing, further reinforce this point. They reflect a stable average prediction error across both datasets, which is a positive aspect for model reliability.

The OLS Linear Model stands out for its simplicity and consistency, making it a valuable model for predicting housing prices, especially when interpretability and understanding of linear relationships are crucial.

10.5 OLS Step Forward Model Analysis

10.5.1 Results and Interpretation

Model Performance

The OLS Step Forward Model, which iteratively adds the most significant variables, shows good model performance with R-squared values of 0.8178 for training and 0.81961 for testing. These figures indicate that the model explains over 81% of the variance in the housing prices for both datasets, showcasing a balanced fit and good predictive power.

The RMSE values of 154,375.23 for training and 161,191.81 for testing indicate a moderate level of prediction error. These figures, while slightly higher than the OLS Linear Model, are still reasonable for a regression model. The consistency of RMSE between training and testing suggests that the model is not overfitted to the training data and can generalize well.

The MAE values of 92,195.59 for training and 93,107.59 for testing further confirm the model’s stable predictive performance across the two datasets. The similarity in these values suggests that the model’s average prediction error is consistent, which is advantageous for practical applications.

The OLS Step Forward Model is a robust option for predicting housing prices, providing a good balance between model complexity and performance. Its incremental approach in selecting variables can be particularly useful in identifying the most relevant features affecting housing prices.

10.6 OLS Step Both Model Analysis

10.6.1 Results and Interpretation

Model Performance

The OLS Step Both Model, which considers both adding and removing variables in each step, shows R-squared values of 0.81614 for training and 0.81715 for testing. These values, though slightly lower than the previous OLS models, still indicate that the model explains a substantial portion of the variance in housing prices, with over 81% in both training and testing data.

The model’s RMSE values of 155,079.14 for training and 162,285.16 for testing are somewhat higher than the other OLS models, suggesting a slightly increased average prediction error. However, these values are still within a reasonable range for regression models.

The MAE values of 92,978.7 for training and 94,247.4 for testing are consistent and reflect the model’s stable performance in predicting housing prices across different datasets. This stability is crucial in ensuring the model’s reliability in various scenarios.

The OLS Step Both Model provides a balanced approach to variable selection, offering a good mix of predictability and interpretability. While it may not be the strongest model in terms of minimizing error metrics, its comprehensive approach to feature selection makes it a valuable tool in understanding the factors influencing housing prices in King County.

10.7 OLS Step Both with Outliers Model Analysis

10.7.1 Results and Interpretation

Model Performance

The OLS Step Both with Outliers Model, which presumably includes a stepwise approach while also addressing outliers, shows an R-squared of 0.82176 in training and 0.8167 in testing. This indicates a strong ability to explain the variance in housing prices, slightly better than some other OLS models. The slight decrease in R-squared from training to testing could suggest minor overfitting or variance in the test data not captured by the model.

In terms of error metrics, the RMSE values are 148,908.82 for training and 162,486.45 for testing. The lower RMSE in the training set compared to the testing set might be due to the model being slightly more tuned to the training data, especially in the context of outlier handling. However, these values are relatively lower compared to other models, which is a positive sign.

The MAE values of 91,385.31 for training and 93,114.76 for testing are consistent and indicate a stable average prediction error across both datasets. This stability is a positive indicator of the model’s reliability across different scenarios.

The OLS Step Both with Outliers Model appears to be a strong contender in terms of balancing the trade-off between complexity and prediction accuracy, especially with its focus on outlier handling, which is crucial in real estate data with its often skewed and outlier-prone distributions.

10.8 Regression Tree Model Analysis

10.8.1 Results and Interpretation

Model Performance

The Regression Tree Model, a non-linear approach, shows R-squared values of 0.79117 for training and 0.75897 for testing. These figures, while lower than the linear models, still represent a decent level of predictive accuracy, especially considering the model’s non-linear nature. The drop in R-squared from training to testing suggests some overfitting, typical in tree-based models.

The RMSE values stand at 165,274.84 for training and 186,382.77 for testing, indicating higher average prediction errors compared to the linear models. This could be due to the model capturing complex patterns in the training data that don’t generalize as well to the test set.

The MAE values of 109,635.94 for training and 116,844.59 for testing are the highest among all models discussed so far. This suggests that the average prediction error of the Regression Tree is comparatively larger, which might be a critical factor to consider in its practical application.

While the Regression Tree Model brings the advantage of modeling non-linear relationships, which are often present in real estate data, its tendency towards overfitting and higher error rates compared to linear models may limit its suitability as the primary model for predicting housing prices in King County.

10.9 Weighted Transformed Model 200 Iterations Analysis

10.9.1 Results and Interpretation

Model Performance

The Weighted Transformed Model 200 Iterations, potentially a more complex or specialized model, shows R-squared values of 0.75884 for training and 0.74016 for testing. These values are comparatively lower than both the linear and the Regression Tree models, suggesting a lesser ability to explain the variance in housing prices. The consistency between training and testing R-squared values, however, does indicate a level of robustness in the model’s predictive ability across different datasets.

The RMSE values are significantly higher, with 642,962.63 for training and 662,703.16 for testing, indicating very high average prediction errors. This might raise concerns about the model’s practical application, as such high errors could significantly impact decision-making in real estate markets.

Similarly, the MAE values of 537,582.19 for training and 543,270.39 for testing are extremely high, further emphasizing the model’s limited accuracy in predicting housing prices. This could be a result of the model’s complexity, potential overfitting, or it not being well-tuned to this specific dataset.

The Weighted Transformed Model 200 Iterations, despite its potential complexity and sophistication, does not seem to perform as well as the other models in predicting housing prices in King County. Its high error metrics and lower R-squared values suggest that it might not be the best choice for this particular application, especially when accuracy and reliability are paramount.


11. Primary and Benchmark Models

11.1 Overall Model Performances

Given the comprehensive analysis of various models for predicting housing prices in King County, the selection of the Neural Network model as the primary choice, with the Transformed Model as a backup, is based on a careful consideration of several key performance metrics.

11.2 Primary Choice: Neural Network Model

  1. High R-squared Values: The Neural Network model exhibited the highest R-squared values (0.9404 for training and 0.8734 for testing), indicating superior predictive accuracy. This model explained a significant portion of the variance in housing prices, more so than any other model tested. High R-squared in training shows the model’s ability to capture complex relationships in the data, while the relatively high value in testing indicates good generalization capabilities.

  2. Balanced RMSE and MAE: Although the RMSE and MAE values for the Neural Network model were higher in the testing set compared to the training set (suggesting some overfitting), these values were still among the lowest across all models. The RMSE of 150,299.41 and MAE of 83,381.52 in the testing set are indicative of relatively accurate predictions. Lower error metrics are crucial in real estate applications where large prediction errors can be costly.

  3. Model’s Flexibility and Complexity: Neural Networks are inherently more flexible and capable of modeling complex non-linear relationships that are often present in real estate data. This makes them well-suited for capturing intricate patterns and interactions among variables that linear models might miss.

11.3 Backup Choice: Transformed Model

  1. Consistent R-squared Across Datasets: The Transformed Model showed consistent R-squared values in both training (0.87256) and testing (0.87105), which were second only to the Neural Network model. This consistency is a strong indicator of the model’s ability to generalize well, avoiding the pitfall of overfitting.

  2. Reasonable RMSE and MAE: While the RMSE and MAE values of the Transformed Model were higher compared to the Neural Network, especially in the testing set (RMSE of 200,676.10 and MAE of 81,099.88), they were still within a reasonable range. These figures suggest that the model is capable of making relatively accurate predictions, albeit not as precise as the Neural Network.

  3. Potential for Robustness in Feature Handling: The name ‘Transformed Model’ suggests that it involves some form of data transformation or feature engineering. This could imply an effective handling of outliers or skewed distributions, which are common in housing data. Such transformations can make the model robust to anomalies in the data, which is a valuable quality in real estate market analysis.

11.4 Comparative Analysis

  1. Overfitting Concerns: While the Neural Network showed some signs of overfitting (as evidenced by the drop in R-squared from training to testing), its overall performance metrics still outperformed other models. The Transformed Model, with its consistent performance across training and testing, serves as a good backup to mitigate any overfitting risks.

  2. Error Metrics Consideration: In real estate price prediction, minimizing prediction errors is crucial due to the high stakes involved. Both the Neural Network and the Transformed Model presented reasonable error metrics, making them preferable for such applications.

  3. Complexity vs. Interpretability Trade-off: While simpler models like OLS Linear or OLS Stepwise models offer greater interpretability, they fell short in capturing the complexity of the housing market data as effectively as the Neural Network. The Transformed Model strikes a balance between complexity and interpretability, making it a suitable backup.

The Neural Network model’s superior ability to capture complex patterns and its lower error metrics make it the best choice for predicting housing prices in King County. The Transformed Model, with its consistent performance and potential robustness, serves as an excellent backup, especially in scenarios where overfitting or complexity of the primary model might be a concern.

11.3 Benchmark Model (Lasso) Justification

In choosing the Lasso Model as the benchmark for predicting housing prices in King County, several key factors have been considered. This analysis assesses the Lasso Model’s performance with regard to linearity, normality, multicollinearity, outliers, and heteroscedasticity. The examination primarily focuses on the model’s coefficients and their differences (total_diff) in comparison to other robust models (Huber, Ridge, Elastic Net, and Lasso).

11.4 Analysis of Coefficients

  1. Total Difference (total_diff): The Lasso Model demonstrates a relatively low total_diff value compared to alternative models. This metric serves as a significant indicator of the model’s overall stability and consistency in estimating coefficients across different modeling techniques.

  2. Multicollinearity Control: Lasso inherently addresses multicollinearity by shrinking certain coefficients to zero. This feature selection technique enhances model interpretability, reduces overfitting, and effectively deals with highly correlated predictors.

  3. Outliers and Heteroscedasticity Management: Lasso’s regularization method not only facilitates feature selection but also mitigates sensitivity to outliers. This is evident in the stability of coefficients associated with variables such as ‘lat,’ ‘long,’ ‘sqft_living,’ and ‘grade,’ which are critical in housing price predictions. By penalizing the absolute magnitude of coefficients, Lasso diminishes the influence of outliers and aids in handling heteroscedasticity.

  4. Handling Complex Feature Spaces: The housing market dataset often includes a mix of categorical (e.g., zipcode, condition, grade) and continuous variables (e.g., sqft_living, bedrooms). Lasso adeptly manages this complexity by selecting the most relevant features and mitigating overfitting, as indicated by its coefficients across various predictors.

  5. Comparison with Other Models: In comparison to models such as Ridge or Elastic Net, Lasso strikes a balance between coefficient shrinkage and feature selection. Unlike Ridge, which shrinks coefficients but retains all variables, Lasso’s approach of setting some coefficients to zero simplifies the model and enhances interpretability.

  6. Coefficient Interpretation and Real Estate Insights: The Lasso Model’s coefficients align with intuitive real estate dynamics. For instance, positive coefficients for ‘sqft_living,’ ‘bathrooms,’ and ‘grade_Above_Average’ correspond to the understanding that larger, higher-graded houses with more amenities generally command higher prices. Negative coefficients for certain zip codes may suggest less desirable areas, a common insight in real estate analysis.

  7. Consideration of Linearity and Normality: While Lasso does not directly address linearity or normality assumptions, its reduction of multicollinearity indirectly contributes to a more stable and reliable linear regression framework.

11.5 Conclusion

The Lasso Model’s effective control of multicollinearity, handling of outliers, and balanced approach to feature selection and regularization make it a well-suited benchmark model for predicting housing prices. Its lower total_diff value and the interpretability of its coefficients, especially within the context of real estate data, further support its appropriateness for this application. This approach, while robust, maintains a crucial level of simplicity and interpretability that is essential for practical real estate market analyses.